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We consider median regression and, more generally, a possibly 
infinite collection of quantile regressions in high-dimensional sparse 
models. In these models the number of regressors p is very large, pos- 
sibly larger than the sample size n, but only at most s regressors have 
a non-zero impact on each conditional quantile of the response vari- 
able, where s grows slower than n. Since ordinary quantile regression 
is not consistent in this case, we consider i'l-penalized quantile regres- 
sion (^i-QR), which penalizes the ^i-norm of regression coefficients, 
as well as the post-penalized QR estimator (post-^i-QR), which ap- 
plies ordinary QR to the model selected by ^i-QR. First, we show 
that for the leading designs i'l-QR is consistent at the near-oracle 
rate y^s7ny'log(p"v"n), uniformly in the compact set U C (0, 1) of 
quantile indices. In deriving this result, we propose a partly piv- 
otal, data-driven choice of the penalty level and show that it satisfies 
the requirements for achieving this rate. Second, we show that for 
the leading designs post-^i-QR is consistent at the near-oracle rate 

s/nyJ\og{p V n), uniformly over U, even if the ^i-QR-selected mod- 
els miss some components of the true models, and the rate could be 
even closer to the oracle rate otherwise. Third, we characterize condi- 
tions under which ^i-QR contains the true model as a submodel, and 
derive bounds on the dimension of the selected model, uniformly over 
U; we also provide conditions under which hard-thresholding selects 
the minimal true model, uniformly over U. Finally, we evaluate the 
performance of ^i-QR and post-^i-QR in a numerical experiment, 
and provide an application to testing the validity of the Solow-Swan 
model for international economic growth. 
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1. Introduction. Quantile regression is an important statistical method for analyzing 
the impact of regressors on the conditional distribution of a response variable (cf. [24], [21]). In 
particular, it captures the heterogeneity of the impact of regressors on the different parts of the 
distribution [8], exhibits robustness to outliers [20], has excellent computational properties 
[31], and has wide applicability [20]. The asymptotic theory for quantile regression is well 
developed under both a fixed number of regressors and an increasing number of regressors. 
The asymptotic theory under a fixed number of regressors is given in [21], [30], [16], [18], [13] 
and others. The asymptotic theory under an increasing number of regressors is given in [17] 
and [3, 5], covering the case where the number of regressors p is negligible relative to the 
sample size n (i.e., p = o(n)). 

In this paper, we consider quantile regression in high-dimensional sparse models (HDSMs). 
In such models, the overall number of regressors p is very large, possibly much larger than the 
sample size n. However, the number of significant regressors for each conditional quantile of 
interest is at most s and smaller than the sample size, that is, s = o[n). HDSMs ([7, 12, 29]) 
have emerged to deal with many new applications arising in biometrics, signal processing, 
machine learning, econometrics, and other areas of data analysis where high-dimensional data 
sets have become widely available. 

A number of papers have begun to investigate estimation of HDSMs, primarily focusing 
on penalized mean regression, with the ^i-norm acting as a penalty function [7, 12, 23, 
29, 36, 38]. [7, 12, 23, 29, 38] demonstrated the fundamental result that £i-penalized least 
squares estimators achieve the rate yj s/n^J\ogp^ which is very close to the oracle rate \fsjn 
achievable when the true model is known. [36] demonstrated a similar fundamental result 
on the excess forecasting error loss under both quadratic and non-quadratic loss functions. 
Thus the estimator can be consistent and can have excellent forecasting performance even 
under very rapid, nearly exponential growth of the total number of regressors p. See [7, 9- 
11, 15, 27, 32] for many other interesting developments and a detailed review of the existing 
literature. 

Our paper's contribution is to develop a set of results on model selection and rates of 
convergence for quantile regression within the HDSM framework. Since ordinary quantile re- 
gression is not consistent in HDSMs, we consider quantile regression penalized by the £i-norm 
of parameter coefficients, denoted £i-QR. First, we show that £i-QR estimates of regression 
coefficients and regression functions are consistent at the near-oracle rate ^s/n-\/log(p V n) 
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in leading designs, uniformly in a compact interval lA C (0, 1) of quantile indices.^ (This 
result is different and hence complementary to [36] 's fundamental results on the rates for 
excess forecasting error loss.) Second, in order to make ^i-QR practical, we propose a partly 
pivotal, data-driven choice of the penalty level, and show that this choice leads to the same 
sharp convergence rate. Third, we show that ^i-QR correctly selects the true model as a 
valid submodel when the non-zero coefficients of the true model are well separated from zero. 
Fourth, we also propose and analyze the post-penalized estimator (post-^i-QR) which applies 
ordinary, unpenalized quantile regression to the model selected by the penalized estimator, 
and thus aims at reducing the regularization bias of the penalized estimator. We show that 
post-^i-QR can perform as well as £i-QR in terms of the rate of convergence, uniformly 
over lA, even if the £i-QR-based model selection misses some components of the true models. 
This occurs because ^i-QR-based model selection can miss only those components that have 
relatively small coefficients. Moreover, post-^i-QR can perform better than ^i-QR if the li- 
QR-based model selection correctly includes all components of the true model as a subset. 
(Obviously, post-^i-QR can perform as well as the oracle if the ^i-QR perfectly selects the 
true model, which is, however, unrealistic for many designs of interest.) Fifth, we illustrate the 
use of £i-QR and post-^i-QR with a Monte Carlo experiment and an international economic 
growth example. To the best of our knowledge, all of the above results are new and contribute 
to the literature on HDSMs. We also hope that our results on post-penalized estimators and 
some proofs could be of interest in other problems. We provide further technical comparisons 
to the literature in Section 2. 

1.1. Notation. In what follows, we implicitly index all parameter values by the sample 
size n, but we omit the index whenever this does not cause confusion. We use the empirical 
process notation as defined in [37]. In particular, given random sample Zi, Z„, let G.«(/) := 
^"^/^ Er=i(/(^i)-E[/(^i)]) and E„/ = Y.l=i f{Zi). We use the notation a < 6 to denote 
a < cb for some constant c > that does not depend on n; and a <p b to denote a = Op{b). 
We also use the notation a\/ b = max{a, 6} and a Ab = min{a, b}. We denote the ^2-norm 
by II • II, ^i-norm by || • ||i, ^co-norm by || • ||oo, and the io-^^norm" by || • ||o (i.e., the number 
of non-zero components). We denote by ||/3||i,n = Sj=i S'jI/Jjl the ^i-norm weighted by a^-'s. 
Finally, given a vector 6 G IR^, and a set of indices T C {1, . . . we denote by 6t the vector 
in which Stj = if j G T, Stj = if j ^ T. 



^Under s oo, the oracle rate, uniformly over a proper compact interval W, is (s/n)logn, cf. [5]; the 
oracle rate for a single quantile index is y^s/n, cf. [17]. 
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2. The Estimator, the Penalty Level, and Overview of Rate Results. In this 
section we formulate the setting and the estimator, and state primitive regularity conditions. 
We also provide an overview of the main results. 

2.1. Basic Setting. The set-up of interest corresponds to a parametric quantile regression 
model, where the dimension p of the underlying model increases with the sample size n. 
Namely, we consider a response variable y and p-dimensional covariates x such that the u-th 
conditional quantile function of y given x is given by 



where Z// C (0, 1) is a compact set of quantile indices. We consider the case where the dimension 
p of the model is large, possibly much larger than the available sample size n, but the true 
model /?(n) has a sparse support 



having only < s < n/log(n Vp) non-zero components for all u ^U. 

The population coefficient f3{u) is known to be a minimizer of the criterion function 



where Pu{t) = {u — l{t < 0})t is the asymmetric absolute deviation function [21]. Given a 
random sample (yi, rri), . . . , rE„), the quantile regression estimator of (3{u) is defined as a 
minimizer of the empirical analog of (2.2): 



In high-dimensional settings, particularly when p > n, ordinary quantile regression is 
generally not consistent, which motivates the use of penalization in order to remove all, or at 
least nearly all, regressors whose population coefficients are zero, thereby possibly restoring 
consistency. A penalization that has been proven to be quite useful in least squares settings 
is the ^i-penalty leading to the Lasso estimator [34]. 

2.2. The Penalized and Post-Penalized Estimators. The £i-penalized quantile regression 
estimator P{u) is a solution to the following optimization problem: 





Tu = support(/?(n)) = {j e {I, . . . ,p} : \Pjiu)\ > 0} 




QuiP)=E[p^iy-x'p)] 



(2.3) 



Qu{P)=Er,[Pu{yi-x[f3)]. 



(2.4) 
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where a'j = E„[x?j]. The criterion function in (2.4) is the sum of the criterion function 
(2.3) and a penalty function given by a scaled ^i-norm of the parameter vector. The overall 



penalty level A-\/?i(l — u) depends on each quantile index u, while A will depend on the set 
U of quantile indices of interest. The ^i-penalized quantile regression has been considered in 
[19] under small (fixed) p asymptotics. It is important to note that the penalized quantile 
regression problem (2.4) is equivalent to a linear programming problem (see Appendix C) 
with a dual version that is useful for analyzing the sparsity of the solution. When the solution 
is not unique, we define f3{u) as any optimal basic feasible solution (see, e.g., [6]). Therefore, 
the problem (2.4) can be solved in polynomial time, avoiding the computational curse of 
dimensionality. Our goal is to derive the rate of convergence and model selection properties 
of this estimator. 

The post-penalized estimator (post-^i-QR) applies ordinary quantile regression to the 
model Tu selected by the ^i-penalized quantile regression. Specifically, set 

f„ = support(^(n)) = {iG{l,...,ri : \Pj{u)\>0}, 

and define the post-penalized estimator P{u) as 

(2.5) p{u) e arg min Qu{P), 

that is, in (2.5) we remove the regressors that were not selected from further estimation. If 
the model selection works perfectly - that is = - then this estimator is simply the 
oracle estimator whose properties are well-known. However, perfect model selection might 
be unlikely for many designs of interest. Specifically, we are interested in the highly realistic 
scenario where the first step estimator (3{u) fails to select some the components of (3{u). Our 
goal is to derive the rate of convergence for the post-penalized estimator and to show that it 
can actually perform well under this scenario. 

2.3. The choice of the penalty level A. In order to describe our choice of the penalty level 
A, we introduce the random variable 

Xij{u - l{Ui < u}) 



(2.6) A = nsup max 



aj\/u{l — u) 

where ui,... ,Un are i.i.d. uniform (0,1) random variables, independently distributed from 
the regressors, xi, . . . ,Xn- The random variable A has a known, that is, pivotal distribution 
conditional on X = [xi, . . . , Xn]' ■ Then we set 

(2.7) A = c • A(l — a|X), where A(l — a\X) := (1 — a)-quantile of A conditional on X, 
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and the constant c > 1 depends on the design.^ Thus the penalty level depends on the pivotal 
quantity A(l — a\X) and the design. Under assumptions D.1-D.4 we can set c = 2, similarly 
to [7]'s choice for least squares. Furthermore, we recommend computing A(l — a\X) using 
simulation of A.^ 

The parameter 1 — a is the confidence level in the sense that our (non-asymptotic) bounds 
on the estimation error will hold with probability close to 1 — a. If we want to maximize this 
confidence level, e.g., as in [7], subject to the estimation error contracting at the optimal rate 
as n — > cx), then we may set 



This choice of 1 — a does not affect the stochastic order of magnitude of A(l — a\X) and leads 
to optimal rates of convergence, as follows from Theorem 1 and 2, respectively. However, a 
high confidence level has the cost of a high regularization bias. Therefore, if, instead, we want 
to minimize the regularization bias subject to the estimation error contracting at the optimal 
rate, then we should set the confidence level 1 — a to grow as slowly as possible. As a limit 
of this rule, we may set 

(2.9) 1 — a = 1 — ao for some fixed 1/p < < \. 

In this case, estimation error will contract at the optimal rate with a probability that is 
close to 1 — oq. Our non-asymptotic bounds on the estimation error stated in Theorem 2 
expressly allow for any choice of 1 — a, including either (2.8) or (2.9). Also, in computational 
experiments, we found that (2.9) with 1 — = .9 worked well. 

The formal rationale behind the choice (2.7) for the penalty level A is that this choice pre- 
cisely leads to the optimal rates of convergence for £i-QR. (The same or slightly higher choice 
of A also guarantees good performance of post-^i-QR.) Our general strategy for choosing A 
follows [7], who recommend selecting A so that it dominates a relevant measure of noise in 
the sample criterion function, specifically the supremum norm of (a suitably rescaled) gradi- 
ent of the sample criterion function evaluated at the true parameter value. In our case this 
general strategy leads precisely to the choice (2.7), because the gradient of the quantile re- 
gression objective function evaluated at the truth has a pivotal representation, as we further 

depends only on the constant co appearing in condition D.4; when cq > 9, it suffices to set c = 2. 
^We also provide analytical bounds on A(l — a\X) of the form C{a,U)y^n logp for some numeric constant 
C{a,U). We recommend simulation because it accounts for correlation among the columns of X in the sample, 
whereas the analytical bound effectively treats the columns of X as uncorrelated and is thus more conservative, 
at least in finite samples. 




1 — a = 1 — 1/p — > 1 as p — > oo. 



explain below. This makes our choice of A independent of the conditional density of yi given 
Xi, which is of considerable practical value. In contrast, in the least squares problem, the 
choice of A depends on the standard deviation of the regression errors, and also relies on the 
homoscedasticity and Gaussianity of errors. 

A less formal, though more intuitive rationale for the choice (2.7) is as follows. By opti- 
mality f3{u) obeys 



G dQuiPiu)) + {X/n)Ju{l - u) d\\P{u)\\i,n for all ueU, 



where d is the subdifferential operator. Let Su denote an element of dQu- Since i9||/3||i,n C 
{Dv : V G [—1, 1]^}, where D = diag[ai, for each u there is an element Su{(3{u)) G 

dQu{P{u)) such that sup^g^^ "-ll-^""^ Su0{u))/ ^/u{l — 'u)||oo ^ A. Then it makes sense to 
choose A so that the true value P{u) also obeys this constraint with a high probability: 



(2.10) 



A 



sup n 



Su{P{u))/Ju{l - u) 



< A^ > 1 



a, 



where Su{(3{u)) = ]En[(^i — l{yi < x[[3{u)})xi\ G dQu{(i{u)). A key observation is that 
Su{(i{u)) = E„[(n — l{ui < u})xi] for ui, . . . , u„ i.i.d. uniform (0, 1) conditional on X, and so 
we can represent A as in (2.6), and, thus, choose A as in (2.7). 



2.4. Regularity Conditions. We consider the following conditions on a sequence of models 
indexed by n with parameter dimension p = p„ — > oo. In these conditions all constants can 
depend on n, but we omit the explicit indexing by n to ease exposition. 

D.l. Sampling and Smoothness. Data {yi,x[y,i = l,...,n, are an i.i.d. sequence of real 
(1 + p) -vectors, with the conditional u-quantile function given by (2.1), and with the first 
component of Xi equal to one. For each value x in the support of Xi, the conditional density 
fyi\xi{y\^) continuously differentiable in y, and fy^\xi{y\^) ^fyi\xi{y\^) ^'"s hounded in 
absolute value by constants f and f, uniformly in y and x. We assume that n Ap> 3. 

Condition D.l imposes only mild smoothness assumptions on the conditional density of the 
response variable given regressors, and does not impose any normality or homoscedasticity 
assumptions commonly made in the literature on HDSMs. 

D.2. Sparsity and Smoothness of u ^ Piu)- Let U be a compact subset of (0,1). The 
coefficients (3{u) in (2.1) are sparse and smooth with respect to u £ U: 

sup ||/3(n)||o < s and \\l3{u) — I3{u')\\ < L\u — u'\, for all u,u 
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where s > 1, and logL < Cl log(p V n) for some constant Cl. 

Condition D.2 imposes sparsity and also smoothness on the behavior of the quantile re- 
gression coefficients (5{u) as we vary the quantile index u. 

D.3. Well-behaved Covariates. Covariates are normalized such that Uj = E[x?j] = 1 for all 

■ — , nhpiiQ P(-ma-v, ^ l;? . _ 1 I <• 1 > 1 _ 



j = 1, . . . ,p, and dj = E.„[x?] obeys P(maxi<j<p \aj — 1| < 1/2) >1 — 7— >1 as n ^ 00. 



Condition D.3 requires that aj does not deviate too much from aj and normalizes cr? = 
E[x?^] = 1. That 1 — 7^1asn— >oo implicitly restricts the growth of p relative to n; in 
particular, for bounded or Gaussian regressors this requires n/logp — > 00. The normalization 
E[x?j] = 1 is not restrictive since we do not exploit it in the construction of the estimator 
and since we can always rescale the original parameters so that regressors have this prop- 
erty. Indeed, given original parameter P°{u) and regressors x°, define the rescaled parameter 
P{u) = Df3°{u) and x = D'^x" for D = diag[o-f , <], af = E[x°J]. Then the rates of con- 
vergence in the original parametrization follow from the rates of convergence in the rescaled 
parametrization from the relation ||G(/3°(n)-/?°(u))||2 < mcixeig{D-^G'GD-^)\0{u)-f3{u)\\'^ 
for any matrix G. 

In order to state the next assumption, for some cq > and each u gU, define 
Au := {5 G : \\6t^\\i < co\\5tJi, \\Sts\\o < n}, 

which will be referred to as the restricted set. Define Tu{S,m) C {1, ■■■,p} \T„ as the support 
of m largest in absolute value components of the vector 6 outside of = support(/3(n)), 
where Tu{S, m) is the empty set if m = 0. 

D.4. Restricted Identifiability and Nonlinearity. For some constants m > and cq > 9, 
the matrices E[xjX^] and = E[fy^\x^{x[(5{u)\xi) Xix'j\, u £U satisfy 

(RE(co,m)) «i:=inf inf ^^M^i^i^ > q, / := inf inf -ijM^ > 0, 
and log(/Ko) < Cflog{n\/ p) for some constant Gf. Moreover, 
(RNI(co)) q:=-li- inf inf ^^^':'f!l]T > 0. 



Comment 2.1 (RE condition). The restricted eigenvalue (RE) condition is a quantile 
analog of [7]'s condition for rnGcins. The RE constctnts Km and / determine the rate of conver- 
gence and can change with n, although in many designs such as Example 1 given below these 
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constants will be bounded away from zero and from above. [7] prove that the RE condition 
on the design matrix E[xjX^] is quite general, and is weaker than nearly all other design con- 
ditions used in the least squares literature; also, since Km is non-increasing in m, RE{co,m) 
for any m > implies RE{co,0). The constant / controls the modulus of continuity between 
norms weighted by the design matrix E[xiX-] and the Jacobian matrices Ju- Note that / is 
bounded below by /°, the minimal value of the conditional density of Ui evaluated at the 
conditional quantile x[(3{u): 

(2.11) inf 4|,^(x'/3(u)|x), 

where f = f° holds with equality in location models, such as Example 1, but / > /° in 
general. The overall rationale behind the RE condition is that under our choice of penalty 
level, S = f3{u) — f3{u) will belong to the restricted set Au with a high probability, and so 
identifiability and rates would follow from the behavior of J[j(5/||(5|p characterized by fn^. 
Lastly, the additional condition log(/«;o) ^ log(n V p) requires that /kq does not increase 
faster than some power of (n V p). This assumption is mild, since typically we are more 
concerned with /^^^kq going to zero, and simplifies the statements of the main results. 

Comment 2.2 (RNI condition). The restricted non-linear impact (RNI) coefficient q 
appearing in D.4 is a new concept, which controls the quality of minoration of the quantile 
regression objective function by a quadratic function over the restricted set, in the sense 
precisely described by Lemma 2. It turns out that this coefficient is well-behaved for several 
designs of interest. Indeed, if the covariates x have a log-concave density, then 

q > 3/"^/^/ (SKif) for a universal constant . 

On the other hand, if the covariates | uniformly bounded by Kb for each j < p, and 

the RE(co, 0) condition holds, then q > 3f/'^no/{8f'KBil+co)^/s). Indeed, the for mer bound 
follows from E[|x^(5|3] < KiE[\x'iS\'^f/^ holding for lo g-concave x for some universal constant 
Kf, by Theorem 5.22 of [28]. The latter bound follows from E[|x'^5|3] < E[|x^(5p]JfB||5||i < 
E[\x'i5\'^]KB{l + co)^\\5tJ < E[\x'i6\'^f/^KB{1 + cq)^s/kq holding since 5 G A„ so that 
||5||i<(1 + co)PtJ|i< V^(1 + co)||5tJ|. 

Finally, we state another condition that is needed to derive results on the post-model 
selected estimator. In order to state the condition, define the sparse set Au{fn) = {(5 e IR^ : 
IIJ-jTcllo < m} for m > and u gU. 
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D.5. Sparse Identifiability and Nonlinearity. The matrices Efx^j;^] and J„, u gU satisfy: 
(SE(m)) := inf inf ^feiE > q and /- := inf inf S'^^'^n. > 0> 

for some ?n > 0. Moreover, 

Comment 2.3 (SE condition). We invoke the sparse eigenvalue (SE) condition in order to 
analyze the post-penalized estimator. This assumption is similar to the conditions used in [29] 
and [38] to analyze Lasso. Our form of the SE condition is neither less nor more general than 
the RE condition. The rationale behind this condition is that the post-penalized estimator 
/3(n) will be sparse, of dimension at most Su = < s + fh, where m is the number of 
unnecessary components, that is, components outside T„. Therefore, both identifiability and 
rates of convergence would follow from the behavior of ^ Jm(5/||(5||^ characterized by 



m 



Comment 2.4 (SNI condition). The SNI coefficient g~ controls the quality of minoration 
of the quantile regression objective function by a quadratic function over sparse neighbor- 
hoods of the true parameter. Similarly to the RNI coefficient, if the covariates x have a 
log-concave density, then the SNI coefficient satisfies 

> {3/8)fJ'/{Kj') 

and if the covariates | uniformly bounded by Kb and SE condition holds, then q~ > 

;3/2, 



(3/8)/~ K:^/ {f Ks^/m + s). Note that if the selected model has no unnecessary components 
(m = 0), condition D.5 is an assumption only on the true support. 

Example 1 (Location Model with Correlated Normal Design). Let us consider estimating 
a standard location model 

y = x'(3° + e, 
where e ~ A^(0, cj^), a > is fixed, xi = 1 and 

x„i ~ iV(0, S), = for a fixed - 1< p < 1. 

This model implies a linear quantile model with coefficients (3i{u) = /3° -|- a(^~^{u) and 
f3j{u) = (3° for J = 2, . . . ,p, in which conditions D.1-5 are easily met for any compact set of 
quantile indices U C (0, 1). Indeed, let 

/' = sup(^'(z/fT)/CT2, / = sup(^(z/a)/cT, f° = m.in<j){^~^{u))/a, 

z z ~ u&A 
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so that D.l holds with the constants / and /'. D.2 holds, since ||/3(n)||o < s < ||/3°||o + 1 
and u f3{u) is Lipschitz over U with the constant L = sup^g^ a/(p{^^^ (u)), which trivially 
obeys logL < log(n Vp). D.4 also holds, in particular by Chernoff's tail bound 

P \ max - 1| < I/2I > 1 - 7 = 1 - 2pexp(-n/24), 

where 1 — 7 approaches 1 if n/logp — > 00. Furthermore, the smallest eigenvalue of the 
population design matrix S is at least (1 — \p\)/{2 + 2\p\) and the maximum eigenvalue is at 
most (1 + |yo|)/(l - \p\). Thus, D.4 and D.5 hold with 

KmAK- > v'(l-|p|)/(2 + 2|p|), l/\l~>r, qAq~>{m{r''^Vf'Ke), 
for all m, m > and for a universal constant > defined in Comment 2.2. 



2.5. Overview of Main Results. Here we discuss our results under the simple setup of 
Example 1 and under l/p < a ^ and 7^0. These simple assumptions allow us to 
straightforwardly compare our rate results to those obtained in the literature. We state our 
more general non-asymptotic results under general conditions in the subsequent sections. Our 
first main rate result is that the ^i-QR, with our choice (2.7) of parameter A, satisfies 



(2.12) supll^(n) -/?(n)|| <, \ >l°g("Vp) ^ 

provided that the upper bound on the number of non-zero components s satisfies 



\/ s logfn V v) 

(2.13) vmp;^Q_ 

Note that kq, Ks, f, and q are bounded away from zero in this example. Therefore, the rate 



of convergence is \J sjn ■ ^log(n V p) uniformly in the set of quantile indices u £ U, which 
is very close to the oracle rate when p grows polynomially in n. Further, we note that our 
resulting restriction (2.13) on the dimension s of the true models is very weak; when p is 
polynomial in n, s can be of almost the same order as n, namely s = o(n/ log n). 

Our second main result is that the dimension ||/3(u)||o of the model selected by the ii- 
penalized estimator is of the same stochastic order as the dimension s of the true models, 
namely 

(2.14) snp\\p{u)h<ps. 
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Further, if the parameter values of the minimal true model are well separated from zero, then 
with a high probability the model selected by the £i-penalized estimator correctly nests the 
true minimal model: 

(2.15) Tu = support (/3(n)) C = support(/3(ii)), for all u £ U. 

Moreover, we provide conditions under which a hard-thresholded version of the estimator 
selects the correct support. 

Our third main result is that the post-penalized estimator, which applies ordinary quantile 
regression to the selected model, obeys 



(2.16) 



1 /?7ilog(n Vp) + slogn 
n 



sup||/3(n)-/?(n)|| S T-^V — — + 



-m rn 



+ 



where m = sup„g^ ll/^T^('u)||o is the maximum number of wrong components selected for any 
quantile index u £ U, provided that the bound on the number of non-zero components s 
obeys the growth condition (2.13) and 



. v/mlog(n Vff) + slogn 

We see from (2.16) that post-^i-QR can perform well in terms of the rate of convergence 
even if the selected model fails to contain the true model Indeed, since in this design 
rh <p s, post-^i-QR has the rate of convergence \fsjn • ^log(n V p), which is the same as 
the rate of convergence of £i-QR. The intuition for this result is that the £i-QR based model 
selection can only miss covariates with relatively small coefficients, which then permits post- 
^i-QR to perform as well or even better due to reductions in bias, as confirmed by our 
computational experiments. 

We also see from (2.16) that post-£i-QR can perform better than £i-QR in terms of the rate 
of convergence if the number of wrong components selected obeys fh = Op(s) and the selected 
model contains the true model, {T„ C T„} with probability converging to one. In this case 



post-^i-QR has the rate of convergence y (oj,(s)/n) log(n Vp) + (s/n)logn, which is faster 
than the rate of convergence of ^i-QR. In the extreme case of perfect model selection, that 



is, when m = 0, the rate of post-^i-QR becomes \J{s/n) log n uniformly in U. (When is a 
singleton, the \ogn factor drops out.) Note that the inclusion {T„ C r„} necessarily happens 
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when the coefficients of the true models are well separated from zero, as we stated above. 
Note also that the condition m = o{s) or even m = could occur under additional conditions 
on the regressors (such as the mutual coherence conditions that restrict the maximal pairwise 
correlation of regressors). Finally, we note that our second restriction (2.17) on the dimension 
s of the true models is very weak in this design; when p is polynomial in n, s can be of almost 
the same order as n, namely s = o{n/ log n). 

To the best of our knowledge, all of the results presented above are new, both for the 
single £i-penalized quantile regression problem as well as for the infinite collection of ^i- 
penalized quantile regression problems. These results therefore contribute to the rate results 
obtained for £i-penalized mean regression and related estimators in the fundamental papers of 
[7, 12, 23, 29, 36, 38]. To the best of our knowledge, our results on post-penalized estimators 
have no analogs in the literature on mean regression, apart from the rather exceptional 
case of perfect model selection, in which case the post-penalized estimator is simply the 
oracle.^ Our results on sparsity of £i-QR and model selection also contribute to the analogous 
results for mean regression [29]. Also, our rate results for £i-QR are different from, and hence 
complementary to, the fundamental results in [36] on the excess forecasting loss under possibly 
non-quadratic loss functions, which also specializes the results to density estimation, mean 
regression, and logistic regression. Indeed, in principle we could apply theorems in [36] to the 
single quantile regression problem to derive the bounds on the excess loss from forecasting yi 
with x^j3{u) under loss pu-^ However, these bounds would not imply our results (2.12), (2.16), 
(2.14), (2.15), and (2.7), which characterize the rates of estimating coefficients f3{u) by ^i-QR 
and post-^i-QR, sparsity and model selection properties, and the data-driven choice of the 
penalty level. 

3. Main Results and Main Proofs. In this section we derive rates of convergence 
for ^i-QR and post-£i-QR, sparsity bounds, and model selection results. 

3.1. Bounds on K{l — a\X) . We start with a characterization of A and its (1— a)-quantile, 
A(l — ajAT), which determine the magnitude of our suggested penalty level A via equation 
(2.7). 

*In a companion work, we extend our results to least squares and related problems. 

^Of course, such a derivation would entail some difficult work, since we must verify some high-level as- 
sumptions made directly on the performance of the oracle and penalized estimators in population (cf. [36] 's 
conditions I.l and 1.2 and others), and which do not apply in our main examples, e.g., in Example 1 with 
normal regressors. 
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Theorem 1 (Bounds on A(l - a\X)). Let Wu = max„gj/ l/^u(l - u). We have that 
there is a universal constant C\ such that 



(i) P{A>k-CA Wu log p \X] <p 



-k^+i 



(a) A{1 — a\X) < y 1 + log(l/a)/logp • Ca Wu\/n logp with probability 1. 

3.2. Rates of Convergence. In this section we establish the rate of convergence of ii — QR. 
We start with the following preliminary result which shows that if the penalty level exceeds 
the specified threshold, the estimator P{u) — P{u) will belong to the restricted set A^. 

Lemma 1 (Restricted Set). 1. Under D. 3, with probability at least 1 — 7 we have for every 
5eW that 

(3.1) ^||<^||l,n<||<5||l<2P||i,„. 
2. Moreover, if for some a G (0, 1) 

(3.2) A> Ao:=^^A(l-a|X), 

Co — o 

then with probability at least 1 — a — 7, uniformly in u GU, we have (3.1) and 
p{u) - (i{u) G A = {5eW: \\6t^\\i < co\\6tJi, \\Sts\\o < n}. 

This result is inspired by the analogous result of [7] for least squares. 

Lemma 2 (Identifiability Relations over Restricted Set). Condition D.4, namely RE(co, m) 
and RNI(co), implies that for any 6 £ and u gU, 

(3.3) ||(E[x.x^])V2<5|| < 

(3.4) \\STji<V~4J'J'mt^''^o], 

(3.5) \\6h<V~4l + co)\\J'J'S\\/[f/'^o], 

(3.6) m < (^1 + coy^) ll^y'<5||/[/i/2^„], 

(3.7) QM^) +6)- QMu)) > (II jy^^ll V4) A {qpllHW). 

This second preliminary result derives identifiability relations over Au. It shows that the 
coefficients /, kq, and Km control moduli of continuity between various norms over the re- 
stricted set Au, and the RNI coefficient q controls the quality of minoration of the objective 
function by a quadratic function over A^. 

Finally, the third preliminary result derives bounds on the empirical error over -A^: 
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Lemma 3 (Control of Empirical Error). Under D.1-4, for any t > let 

e{t) := sup \Qu{P{u) + 6)- Qu{P{u) +6)- [QuiP{u)) - Q„(/3(n))) 

Then, there is a universal constant Ce such that for any A > 1, with probability at least 
1 - 37 - 3j3-^' 



e{t)<t-CE ('^ + ^o)Ajslog{pV[Lf/^Ko/t]) 



n 



In order to prove the lemma we use a combination of chaining arguments and exponential 
inequalities for contractions [25]. Our use of the contraction principle is inspired by its fun- 
damentally innovative use in [36]; however, the use of the contraction principle alone is not 
sufficient in our case. Indeed, first we need to make some adjustments to obtain error bounds 

1/2 

over the neighborhoods defined by the intrinsic norm || Ju • || instead of the || • ||i norm; and 
second, we need to use chaining over u £ U to get uniformity over U. 

Armed with Lemmas 1-3, we establish the first main result. The result depends on the 
constants Ca, Ce, Cl, and Cj defined in Theorem 1, Lemma 3, D.2, and D.4. 

Theorem 2 (Uniform Bounds on Estimation Error of ^i-QR). Assume that conditions 



D.1-4, and let C > 2Ca ^1 + log(l/a)/ logp V [Ce^JI V [Cl + C/ + 1/2]]. Let Aq be defined 
as in (3.2). Then uniformly in the penalty level A such that 



(3.8) Ao < A <C -Wu^nlogp, 

we have that, for any A> 1 with probability at least 1 — a — 47 — 3p~^^ , 



II 7i/2/«/ N at wii ^ Qr< (l + co)WwA /slog(pVn) 
sup \\JJ {I3{u) - (3{u))\\ < 8C • 



n 



J- II u, M \ / . ^ / / II — ,.1/2 \/ 

u&U f_' Ko V 

provided s obeys the growth condition 

(3.9) 2C • (1 + Co)WuA ■ ^slog{p V n) < qf/^^^oV^. 

This result derives the rate of convergence of the £i-penalized quantile regression estimator 
in the intrinsic norm uniformly inu £U as well as uniformly in the penalty level A in the range 
specified by (3.8), which includes our recommended choice of Aq- An immediate consequence 
of this result and of Lemma 2 is the following corollary. 
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Corollary 1. Under the conditions of Theorem 2, for any A > 1 with probability at 
least 1 — a — 47 — 3p~^^ , 



sup 



{e^[x'0{u) - /?(n))]2) < 8C 



^/2/o^ {^ + co)WuA /slog(pVn) 



fKQ V 



sup\\P{u) - P{u)\\ < 8C \ . 

ueu i^ni 1^0 V n 

We see that the rates of convergence for ii-QR generaUy depend on the number of signifi- 
cant regressors s, the logarithm of the number of regressors p, the strength of identification 
summarized by kq, k^, f, and q, and the quantile indices of interest U (as expected, extreme 
quantiles can slow down the rates of convergence). These rate results parallel the results of 
[7] obtained for £i-penalized mean regression. Indeed, the role of the parameter / is similar to 
the role of the standard deviation of the disturbance in mean regression. It is worth noting, 
however, that our results do not rely on normality and homoscedasticity assumptions, and 
our proofs have to address the non-quadratic nature of the objective function, with parameter 
q controlling the quality of quadratization. This parameter q enters the results only through 
the growth restriction (3.9) on s. At this point we refer the reader to Section 2.4 for a further 
discussion of this result in the context of the correlated normal design. Finally, we note that 
our proof combines the star-shaped geometry of the restricted set A^ with classical convexity 
arguments; this insight may be of interest in other problems. 

Proof of Theorem 2. We let 



^ • ^ ''"^^"^ ,/slog(pVn) 



/'/'ko V n 
and consider the following events: 



(i) 

(ii) ^2 
(iii) 



the event that (3.1) and (3{u) — (3{u) £ Au, uniformly in n G hold; 
the event that the bound on empirical error e{t) in Lemma 3 holds; 



the event in which A(l — a\X) < y^l + log(l/a)/ logp • Ca Wu\/n log p. 

By the choice of A and Lemma 1, P{^i) > 1 — a — 7; by Lemma 3 P{^2) > 1 — 37 — 3p~^^; 
and by Theorem 1 ^(^3) = 1, hence P{r\\^^Qk) > 1 - a - 47 - 3p"^^ 

Given the event H^^j^rifc, we want to show the event that 
(3.10) 3u G U, II J,y2(/3(n) - I3{u))\\ > t 
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is impossible, which will prove the theorem. First note that the event in (3.10) implies that 
for some u 



> min QMu) +6)- Qu{/3{u)) + ^^""^^ + - ||/3(^)||i,n) 

S&Au,\\jy'5\\>t n 



The key observation is that by convexity of Qui') + || ■ ||i,nA\/u(l — u)/n and by the fact that 
Au is a cone, we can replace \\Ju 6\\ > t by \\Ju 6\\ = t in the above inequality and still 
preserve it: 

> mill QMu) +6)- Quim) + + _ . 

5eA„,||jy^<5||=t 

Also, by inequality (3.4) in Lemma 2, for each 6 G 

||/3(n)||i,„ - + 6h,n < \\STjl,n < 2||5tJ|i < 2^3^^/^ 6\\ / I'^^o, 

which then further implies 



(3.11) > min QMu) + 5)- QMu)) - ]f \\J}/H\\. 

Also by Lemma 3, under our choice of t > 1/[/1/2ko^, log(L/K2) < {Cl + Cf) log(n V p), 
and under event fl2 



(3.12) e{t)<tCE^lV[CL + Cf + l/2] 

Therefore, we obtain from (3.11) and (3.12) 



-{l + co)A /slog(pVn) 



n 



> miri Qu{(3{u) + 6)-QMu))-^^^^^^^^\\J'J'H 

{l + co)A /slog(p Vn) 



t CeJI V [Cl + Cf + 1/2] 



f/'^o V 

Using the identifiability relation (3.7) stated in Lemma 2, we further get 



n 



> ^ A i,t) - t AT^^g^ VL _ , CeJ i V [Cl + + 1/2] -^^ + ^'^^ ■ ^ 



Using the upper bound on A under event O3, we obtain 



n 
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Note that qt cannot be smaller than under the growth condition (3.9) of the theorem. 
Thus, using also the lower bound on C given in the theorem and cq > 1, we obtain the 
relation 

0>'--t.2C i^+^oWuA Islo^ipyn)^^ 

which is impossible. Therefore, the result follows. □ 

3.3. Sparsity Properties. Next we derive sparsity properties of the solution to ^i-penalized 
quantile regression. Fundamentally, sparsity is linked to the first order optimality conditions 
of (2.4) and therefore to the (sub)gradient of the criterion function. In the case of least squares, 
the gradient is a smooth (linear) function of the parameters. In the case of quantile regression, 
the gradient is a highly non-smooth (piece-wise constant) function. To control the sparsity of 
P{u) we rely on empirical process arguments to approximate gradients by smooth functions. 
In particular, we crucially exploit the fact that the entropy of all m-dimensional submodels 
of the p-dimensional model is of order mlogp, which depends on p only logarithmically. 

The statement of the results will depend on the maximal /c-sparse eigenvalue of E [xiX^] 
and E„ [a^jX^], specifically on 

(3.13) (fiik) = max E [(x^^)^] and (j){k) = sup E„ [(x^J)^] V E [(x-(5) 



=l,||<5||o<fc 



||<l,P||o<fe 



In order to establish our main sparsity result, we need two preliminary lemmas. 

Lemma 4 (Empirical Pre-Sparsity) . Let s = sup^^^^ ||/?(ii)||o. Under D.1-4, for any A > 0, 
with probability at least 1 — 'j we have 

s<nApA[An^cl){s)W^/X^]. 

In particular, if X> 2\plWu\/ nXogin V p)(/)(?i/ log(n V p)) then s <n/ log(n V p). 

This lemma establishes an initial bound on the number of non-zero components s as a 
function of A and </>(s). Restricting A > 2^/2Wll^Jn log(n \J p)(f){n/ log(n M p)) makes the 
term (f) {n/\og{n Vp)) appear in subsequent bounds instead of the term </>(n), which in turn 
weakens some assumptions. Indeed, not only is the first term smaller than the second, but 
also there are designs of interest where the second term diverges while the first does not; for 
instance in Example 1, we have (^(n/log(n Vp)) <p 1 while (/)(n) >p \/\ogp by [4]. 

The following lemma establishes a bound on the sparsity as a function of the rate of 
convergence. 



19 
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Lemma 5 (Empirical Sparsity). Assume D.1-4 and let r = sup^^^ 
Then, for any e > 0, there is a constant > \/2 such that with probability at least 1 — e — 7 



Finally, we combine these results to establish the main sparsity result. In what follows, we 
define 4>£ as a constant such that 0(n/log(n Vp)) < cpe with probability 1 — e. 

Theorem 3 (Uniform Sparsity Bounds). Let e > be any constant, assume D.1-4 hold, 
and let A satisfy A > Aq and 



KWu\Jn\og{n\/ p) < A < K'Wu\J n\og{n\J p) 

~\ /2 

for some constant K' > K > IK^cjiJ , for defined in Lemma 5. Then, for any A > 1 with 
probability at least 1 — a — 2e — — p~^' 



s := sup||/^(n)||o < s 



l6fiWu/f^/^Ko ' [(1 + co)AK'/K] 



2 



2 



where jj. := log(n V p)), provided that s obeys the growth condition 
(3.14) 2K'{1 + Co)AWuJs\og{nVp) < qf^'^KQ^. 



The theorem states that by setting the penalty level A to be possibly higher than our 
initial recommended choice Aq, we can control s, which will be crucial for good performance 
of the post-penalized estimator. As a corollary, we note that if (a) /U < 1, (b) 1/(/^/^ko) < 1, 
and (c) < 1 for each e > 0, then s ^ s with a high probability, so the dimension of the 
selected model is about the same as the dimension of the true model. Conditions (a), (b), 
and (c) easily hold for the correlated normal design in Example 1. In particular, (c) follows 
from the concentration inequalities and from results in classical random matrix theory; see 
[4] for proofs. Therefore the possibly higher A needed to achieve the stated sparsity bound 
does not slow down the rate of £i-QR in this case. The growth condition (3.14) on s is also 
weak in this case. 

Proof of Theorem 3. By the choice of K and Lemma 4, s < n/log(n Vp) with proba- 
bility 1 — e. With at least the same probability, the choice of A yields 

Vnlog(7i Vp)0(g) ^ Ke^^/^ ^ 1 



KWu - 2Wu ' 
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so that by virtue of Lemma 5 and by iJ,{s) < fi := ii{n/ \og{n V p)), 

\/s (rAl)n \/s \/s ^ (r A 1)?7. 
< a 1 or < In , 

with probabihty 1 — 2e. Since all conditions of Theorem 2 hold, we obtain the result by 

1/2 ^ 

plugging in the upper bound on r = sup^g^ \\Ju'\P{u)- [3{u))\\ from Theorem 2. □ 

3.4. Model Selection Properties. Next we turn to the model selection properties of £i-QR. 

Theorem 4 (Model Selection Properties of £i-QR). Let r° = sup„g^ - /3(n)||. // 

inf„g2^ minjgT„ \Pj{y)\ > r° , then 

(3.15) Tu := support(/3(ti)) C Tu := support(/3(n)) for all u GU. 
Moreover, the hard-thresholded estimator (3{u), defined for any j > by 

(3.16) i3j{u)=^,iu)l{0j{u)\>-/}, u£U, j = l,...,p, 
provided that 7 is chosen such that r° < 7 < inf^g^^ minjgT^ l/3j(^*)| — satisfies 

support(/3(n)) = for all u £ U. 

These results parallel analogous results in [29] for mean regression. The first result says 
that if non-zero coefficients are well separated from zero, then the support of ^i-QR includes 
the support of the true model. The inclusion of the true support in (3.15) is in general 
one-sided; the support of the estimator can include some unnecessary components having 
true coefficients equal zero. The second result states that if the further stated conditions 
are satisfied, the additional hard thresholding can eliminate inclusions of such unnecessary 
components. The value of the hard threshold must explicitly depend on the unknown value 
miiij^Tu |/5j('u)|, characterizing the separation of non-zero coefficients from zero. The addi- 
tional conditions stated in this theorem are strong and perfect model selection appears quite 
unlikely in practice. Certainly it does not work in all real empirical examples we have explored. 
This motivates our analysis of the post-model-selected estimator under the conditions that 
allow for imperfect model selection, including cases where we miss some non-zero components 
or have additional unnecessary components. 

3.5. The post-penalized estimator. In this section we establish a bound on the rate of 
convergence of the post-penalized estimator. The proof will rely crucially on the identifiability 
and control of the empirical error over the sparse sets Au{fh) := {5 S IR^ : ||(5t=||o ^ "^-j- 
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Lemma 6 (Sparse Identifiability and Control of Empirical Error). 1. Suppose D.l and 
D.5 hold. Then for all 5 G Au{fh),u G U, and fh < n, we have that 

(3.17) QMu) +6)- QMu)) > A {q~\\J'J'6\\) . 

2. Suppose D.1-2 and D.5 hold and that \ UueU Tu\ < n. Then for any e > 0, there is a 
constant such that with probability at least 1 — e the empirical error 



eu{S) := Qu{(3{u) +6)- Qu{f3{u) + 6)- (QJ/3(n)) - Q„(/3(n)) 

obeys 



sup ^ < g /(^MnVp)+.logn)0(m + .) ^ ^ 

In order to prove this lemma we crucially exploit the fact that the entropy of all m- 
dimensional submodels of the p-dimensional model is of order mlogp, which depends on 
p only logarithmically. The following theorem establishes the properties of the post-model- 
selection estimators. 

Theorem 5 (Uniform Bounds on Estimation Error of post-^i-QR). Let (3{u) be any first 
step estimator, Bn a random variable such that Bn > sup^g^ (5m(/9(u)) — Qu{I3{u)), and (3{u) 
the second step estimator defined as (2.5) for each u Assume that \ UueuTuI < n, D.1-3 
hold, and D.5 holds with fh := sup„gj^ ll/5r=(ii)||o with probability 1 — e. Then for any e > 
there is a constant Cg such that the bound 

(3.18) sup J^/^Piu) - P{u)) <4A,,„/[/i/V] + supl{r„ 2f4V4B; 



holds with probability at least 1 — 2e, where A^^n '■= Ce\/{'m\og{n \/ p) + slogn)<j){rh + s)/n, 
provided that s obeys the growth condition Aqf^A^^n/if ^ i^-m] '^^'^PueU H^w 2 Tu}Bn < 4g^. 
In particular, when the first step estimator j3{u) is the £i-QR estimator defined by (2.4) and 
D.1-5 hold, the random variable Bn can be defined as 

(3.19) Bn = XV^r/{nf/\o), r := sup WJ^^^Piu) - /?(n))||. 

This theorem describes the performance of a general post-model selection estimator as well 
as the performance of the post-^i-QR estimator that results from using ii-QR as the model 
selector. After plugging in the bound on the rate r and the choice of A from Theorem 2 into 
(3.19), we obtain the following more explicit statement about the performance of post-£i-QR, 
as measured by the intrinsic norm. 
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Corollary 2. Under the conditions of Theorems 2 and 5, when the first step estimator 
P{u) is the li-QR estimator defined by (2.4), with probability at least 1 — 3j — 3p~ 



- 2e 



jI'\P{u) - (3{u)) 



^ 1/1^(771 + s) /mlog(n Vp) + slogn^ 



71/2- 

Lm '^m 



n 



um f ' no V n 



We can use the following corollary to assess performance in other norms of interest. 



Corollary 3. Under conditions of Theorem 5, with probability I — e 



sup 



sup 



/?(u) - (3{u) 



< sup 



From Corollaries 2 and 3 we can conclude that in many interesting cases the rates of post- 
^i-QR could be the same or faster than the rate of ^i-QR. Indeed, first consider the case where 
the model selection fails to contain the true model, i.e., sup^gj^ ^{Tu % Ty} = 1 with a non- 
negligible probability. If (a) m < s <p s, (b) (/){m-\-s) <p 1, and (c) the constants and 
are of the same order as / and KQi^m, respectively, then the rate of convergence of post-£i-QR 
is the same as the rate of convergence of ^i-QR. Recall that Theorem 3 provides sufficient 
conditions needed to achieve (a), which hold in Example 1. Recall also that in Example 
1 (b) holds by concentration of measure and classical results in random matrix theory, as 
shown in [4], and (c) holds by the calculations presented in Section 2. This verifies our claim 
regarding the performance of post-^i-QR in the overview. Section 2.4. The intuition for this 
result is that even though ii-QH misses true components, it does not miss very important 
ones, allowing post-£i-QR still to perform well. Second, consider the case where the model 
selection succeeds in containing the true model, i.e. sup^gj^ 2 ^m} = with probability 
approaching one, and that the number of unnecessary components obeys m = Op{s). In this 
case the rate of convergence of post-£i-QR can be faster than the rate of convergence of 
^i-QR. In the extreme case of perfect model selection, when m = with a high probability, 
post-^i-QR becomes the oracle estimator with a high probability. We refer the reader to 
Section 2 for further discussion of this result, and note that this result could be of interest in 
other problems. 
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Proof of Theorem 5. To show the first claim, let 

6{u) = (3{u) - /3{u), 6{u) := p{u) - p{u), and tu := \\Ju^^d{u)\\. 
For every n £ Z^, by optimality of P{u) in (2.5), 

(3.20) Qu0{u)) - QuiPin)) < 1{T^ ^ (qMu)) - Qu{P{n))) < l{r„ ^ f;}i?„. 



Also, by Lemma 6, with probability at least 1 — e, we have 



(3.21) sup < g /(^l"g(nVp)+.logn)0(m + .) ^ 

^ ' ueu \\5{u)\\ - V n 

~ ^ ~l/2_ ~ 

Recall that sup^g^ ll'^T=(^^)|| < m < n so that by D.5 t„ > 11(^(14)11 for all u G U with 

probability 1 — e. Thus, combining relations (3.20) and (3.21), for every u 

with probability at least 1 — 2e. Invoking the sparse identifiability relation (3.17) of Lemma 
6, with the same probability, for all u 

We then conclude that under the assumed growth condition on s, this inequality implies 



for every u and the bounds stated in the theorem now follow. 

To show the second claim we note that by the optimality of (3{u) in (3.16), with probability 
1 — 7 we have uniformly in u £ U 



(3.22) 



n 



\^Ju(\ - u) - A^m(1 - u) ^ 

< ¥Tuiu)\\l,n < 2\\6tAU)\\i, 

n n 



where the last term in (3.22) is bounded by 

X^u{l-u)2^s\\jl'%u)\\ ^ \^u{l-u) 2^sr ^ ^ 
{■^■26) —p, < —p, — < Bn, 

using that \\jI^'^5{u)\\ > f^'^Ko\\6T^{u)\\ from RE(co,0) implied by D.4. □ 
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4. Empirical Performance. In order to access the finite sample practical performance 
of the proposed estimators, we conducted a Monte Carlo study and an application to inter- 
national economic growth. 

4.1. Monte Carlo Simulation. In this section we will compare the performance of the 
canonical, £i-penalized, post-£i-penaIized, and the ideal oracle quantile regression estimators. 
Recall that the post-penalized estimator applies the canonical quantile regression to the 
model selected by the penalized estimator. The oracle estimator applies the canonical quantile 
regression on the true model. (Of course, such an estimator is not available outside Monte 
Carlo experiments.) We focus our attention on the model selection properties of the penalized 
estimator and biases and standard deviations of these estimators. 

We begin by considering the following regression model, as in Example 1, where 

y = x'/3(0.5) + e, /?(0.5) = (1, 1, 1,1,1,0,..., 0)', 

where x consists of an intercept and covariates x-i ~ N{0,T,), and the errors e are inde- 
pendently and identically distributed e ~ A^(0, 1). We set the dimension p of covariates x 
equal to 1000, and the dimension s of the true model to 5, and the sample size n to 200. We 
set the regularization parameter A equal to the 0.9-quantile of the pivotal random variable 
A, following our proposal in Section 2. We consider two variants of the model above with 
uncorrelated and correlated regressors, namely p = and p = 0.5. We summarize the results 
on model selection performance of the penalized estimator in Figures 1-2. In the left panels of 
Figures 1-2, we plot the frequencies of the dimensions of the selected model; in the right panel 
we plot the frequencies of selecting the correct components. From the right panels we see that 
the model selection performance is particularly good. From the left panels we see that the 
frequency of selecting a much larger model than the true model is very small. We also see 
that in the design with correlated regressors, the performance of the estimator is quite good, 
as we would expect from our theoretical results. These results confirm the theoretical results 
of Theorem 4, namely, that when the non-zero coefficients are well-separated from zero, with 
probability tending to one, the penalized estimator should select the model that includes the 
true model as a subset. Moreover, these results also confirm the theoretical result of Theorem 
3, namely that the dimension of the selected model should be of the same stochastic order 
as the dimension of the true model. In summary, the model selection performance of the 
penalized estimator agrees very well with our theoretical results. 

We summarize results on the estimation performance in Table 1. We see that the penalized 
quantile regression estimator significantly outperforms the canonical quantile regression, as 



25 



we would expect from Theorem 2 and from inconsistency of the latter when the number of 
regressors is larger than the sample size. The penalized quantile regression has a substantial 
bias, as we would expect from the definition of the estimator which penalizes large deviations 
of coefficients from zero. Furthermore, we see that the post-penalized quantile regression 
drastically improves upon the penalized quantile regression, particularly in terms of drasti- 
cally reducing the bias. The post-penalized estimator in fact does almost as well as the ideal 
oracle estimator. We also see that the (unarbitrary) correlation of regressors does not harm 
the performance of the penalized and the two-step estimators, which we would expect from 
our theoretical results. In summary, we find the estimation performance of the penalized and 
two-step estimators to be in agreement with our theoretical results. 



MONTE CARLO RESULTS 
Isotropic Gaussian Design 





Mean £q norm 


Mean £i norm 


Bias 


Std Deviation 


Canonical QR 


1000 


25.27 


1.6929 


0.99 


Penalized QR 


5.14 


2.43 


1.1519 


0.37 


Post-Penalized QR 


5.14 


4.97 


0.0276 


0.29 


Oracle QR 


5.00 


5.00 


0.0012 


0.20 


Correlated Gaussian Design 




Mean £q norm 


Mean £i norm 


Bias 


Std Deviation 


Canonical QR 


1000 


29.40 


1.2526 


1.11 


Penalized QR 


5.19 


4.09 


0.4316 


0.29 


Post-Penalized QR 


5.19 


5.02 


0.0075 


0.27 


Oracle QR 


5.00 


5.00 


0.0013 


0.25 



Table 1 

The table displays the average £o and £i norm of the estimators as well as mean bias and standard deviation. 
We obtained the results using 5000 Monte Carlo repetitions for each design. 



4.2. International Economic Growth Example. In this section we apply £i-penalized quan- 
tile regression to an international economic growth example, using it primarily as a method 
for model selection. We use the Barro and Lee data consisting of a panel of 138 countries for 
the period of 1960 to 1985. We consider the national growth rates in gross domestic product 
(GDP) per capita as a dependent variable y for the periods 1965-75 and 1975-85.^ In our 
analysis, we will consider a model with p = 60 covariates, which allows for a total of n = 90 
complete observations. Our goal here is to select a subset of these covariates and briefly 

^The growth rate in GDP over a period from ti to t2 is commonly defined as log{GDPt.2/GDPt^) — 1. 
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Histof^miri of tlie uuiiiljtT of non-zero conipononfs in ,'3(1/2 



Histogram of tlrf mmrljor of corroct coniiJonontM Holecfod (.s = 5. = (1} 
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Fig 1. The figure summarizes the covariate selection results for the isotropic normal design example, based 
on 5000 Monte Carlo repetitions. The left panel plots the histogram for the number of covariates selected out 
of the possible 1000 covariates. The right panel plots the histogram for the number of significant covariates 
selected; there are in total 5 significant covariates amongst 1000 covariates. 



4500 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 




Histogram of tire rrmrrber of non-zero corrrponerits in ,3(1/2) 




5000 
4500 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 




Histogi-arrr of the rrrrrrrber of con-eet corrrporierrts selected {s = 5. ^ = 0.5) 




Fig 2. The figure summarizes the covariate selection results for the correlated normal design example with 
correlation coefficient p = .5, based on 5000 Monte Carlo repetitions. The left panel plots the histogram for the 
number of covariates selected out of the possible 1000 covariates. The right panel plots the histogram for the 
number of significant covariates selected; there are in total 5 significant covariates amongst 1000 covariates. 
We obtained the results using 5000 Monte Carlo repetitions. 
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compare the resulting models to the standard models used in the empirical growth literature 
(Barro and Sala-i-Martin [1], Koenker and Machado [22]). 

One of the central issues in the empirical growth literature is the estimation of the effect 
of an initial (lagged) level of GDP per capita on the growth rates of GDP per capita. In 
particular, a key prediction from the classical Solow-Swan-Ramsey growth model is the hy- 
pothesis of convergence, which states that poorer countries should typically grow faster and 
therefore should tend to catch up with the richer countries. Thus, such a hypothesis states 
that the effect of the initial level of GDP on the growth rate should be negative. As pointed 
out in Barro and Sala-i-Martin [2], this hypothesis is rejected using a simple bivariate re- 
gression of growth rates on the initial level of GDP. (In our case, median regression yields a 
positive coefficient of 0.00045.) In order to reconcile the data and the theory, the literature 
has focused on estimating the effect conditional on the pertinent characteristics of countries. 
Covariates that describe such characteristics can include variables measuring education and 
science policies, strength of market institutions, trade openness, savings rates and others [2]. 
The theory then predicts that for countries with similar other characteristics the effect of the 
initial level of GDP on the growth rate should be negative ([2]) 

Given that the number of covariates we can condition on is comparable to the sample 
size, covariate selection becomes an important issue in this analysis ([26], [33]). In particular, 
previous findings came under severe criticism for relying on ad hoc procedures for covariate 
selection. In fact, in some cases, all of the previous findings have been questioned ([26]). 
Since the number of covariates is high, there is no simple way to resolve the model selection 
problem using only classical tools. Indeed the number of possible lower-dimensional models 
is very large, although [26] and [33] attempt to search over several millions of these models. 
Here we use the Lasso selection device, specifically ^i-penalized median regression, to resolve 
this important issue. 

Let us now turn to our empirical results. We performed covariate selection using ii- 
penalized median regression, where we initially used our data-driven choice of penalization 
parameter A. This initial choice led us to select no covariates, which is consistent with the 
situations in which the true coefficients are not well-separated from zero. We then proceeded 
to slowly decrease the penalization parameter in order to allow for some covariates to be 
selected. We present the model selection results in Table 3. With the first relaxation of the 
choice of A, we select the black market exchange rate premium (characterizing trade open- 
ness) and a measure of political instability. With a second relaxation of the choice of A we 
select an additional set of educational attainment variables, and several others reported in 
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the table. With a third relaxation of A we include yet another set of variables also reported 
in the table. We refer the reader to [1] and [2] for a complete definition and discussion of each 
of these variables. 

We then proceeded to apply ordinary median regression to the selected models and we 
also report the standard confidence intervals for these estimates. Table 2 shows these results. 
We should note that the confidence intervals do not take into account that we have selected 
the models using the data. (In an ongoing companion work, we are working on devising 
procedures that will account for this.) We find that in all models with additional selected 
covariates, the median regression coefficients on the initial level of GDP is always negative 
and the standard confidence intervals do not include zero. Similar conclusions also hold for 
quantile regressions with quantile indices in the middle range. In summary, we believe that 
our empirical findings support the hypothesis of convergence from the classical Solow-Swan- 
Ramsey growth model. Of course, it would be good to find formal inferential methods to 
fully support this hypothesis. Finally, our findings also agree and thus support the previous 
findings reported in Barro and Sala-i-Martin [1] and Koenker and Machado [22]. 

CONFIDENCE INTERVALS AFTER MODEL SELECTION FOR THE 
INTERNATIONAL GROWTH REGRESSIONS 



Penalization 


Real GDP per capita (log) 


Parameter 






A = 1.077968 


Coefficient 


90% Confidence Interval 


A/2 


-0.01691 


[-0.02552, -0.00444] 


A/3 


-0.04121 


[-0.05485, -0.02976] 


A/4 


-0.04466 


[-0.06510, -0.03410] 


A/5 


-0.05148 


[-0.06521, -0.03296] 


Table 2 



The table above displays the coefficient and a 90% confidence interval associated with each model selected by 
the corresponding penalty parameter. The selected models are displayed in Table 3. 

APPENDIX A: PROOF OF THEOREM 1 

Proof of Theorem 1. We note A < Wuraaxi<j<pSup^^KnEn [{u - l{ui < u})xij/aj]. 
For any u £ U, j £ {l,...,p} we have by Lemma 1.5 in [25] that P(|Gn[(n — Iju,, < 
— ^) — 2 exp(— i^^/2). Hence by the symmetrization lemma for probabilities. 
Lemma 2.3.7 in [37], with K > 2\/log 2 we have 
(A.l) 

P(A > K^\X) < 4P (sup„e^maxi<,<p |G°[(n - l{ui < n})x,,/(7,]| > K/i4Wu)\x) 
< 4pmaxi<j<p P [snp^^u |G° [(u - l{ui < u})x^j/aj]\ > K/{AWu)\x) 
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MODEL SELECTION RESULTS FOR THE INTERNATIONAL GROWTH 



REGRESSIONS 


Penalization 




Parameter 


Real GDP per capita (log) is included in all models 


A = 1.077968 


Additional Selected Variables 


A 




A/2 


Black Market Premium (log) 
Political Instability 


A/3 


Black Market Premium (log) 

Political Instability 
Measure of tariff restriction 
Infant mortality rate 
Ratio of real government "consumption" net of defense and education 

Exchange rate 
% of "higher school complete" in female population 
% of "secondary school complete" in male population 


A/4 


Black Market Premium (log) 

Political Instability 
Measure of tariff restriction 
Infant mortality rate 
Ratio of real government "consumption" net of defense and education 

Exchange rate 
% of "higher school complete" in female population 
% of "secondary school complete" in male population 
Female gross enrollment ratio for higher education 
% of "no education" in the male population 
Population proportion over 65 
Average years of secondary schooling in the male population 



A/5 Black Market Premium (log) 

Political Instability 
Measure of tariff restriction 
Infant mortality rate 
Ratio of real government "consumption" net of defense and education 

Exchange rate 
% of "higher school complete" in female population 
% of "secondary school complete" in male population 
Female gross enrollment ratio for higher education 
% of "no education" in the male population 
Population proportion over 65 
Average years of secondary schooling in the male population 
Growth rate of population 
% of "higher school attained" in male population 
Ratio of nominal government expenditure on defense to nominal GDP 
Ratio of import to GDP 
Table 3 

For this particular decreasing sequence of penalization parameters we obtained nested models. 
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where G° denotes the symmetrized empirical process (see [37]) generated by the Rademacher 
variables ei,i = 1, which are independent of [/ = {ui, ...jUn) and X = (xi, Let 
us condition on U and X, and define = {eiXij{u— l{ui < u})/aj : u € U} for j = 1, . . . ,p. 
The VC dimension of is at most 6. Therefore, by Theorem 2.6.7 of [37] for some universal 
constant C( > 1 the function class J-j with envelope function Fj obeys 

N{e\\F,\\p„,2,:Fj,L2{Fn))<n{e,J'j) = C[-6-{Wef{l/ey°, 

where A^(e, .7-", L2 (!?.„)) denotes the minimal number of balls of radius e with respect to the 
L2(Pn) norm || • ||p„,2 needed to cover the class of functions JP"; see [37]. 

Conditional on the data U = {ui, . . . , Un) and X = (xi, . . . , Xn), the symmetrized empirical 
process {G° (/), / G J^j} is sub-Gaussian with respect to the L2(Pn) norm by the Hoeffding 
inequality; see, e.g., [37]. Since ||-FjHp„,2 < 1 and p(^j,P.„) < 1, we have 

||F,Hp„,2 ^\ogn{e,J'j)de < e := (l/4)y'log(6C((16e)6) + (1/4)^10 log 4. 

By Lemma 14 with D = 1, there is a universal constant c such that for any K > 1: 

P ( sup |G^,(/)| > Kce\X,u] < e-^n{e,TjY'^^''~^Ue 
\f€r, j Jo 

(A.2) < (l/2)[6C;(16en-(^"-i)^^|^^^. 

By (A.l) and (A.2) for any > 1 we have 

p(A>k- {iV2ce)Wu^/^^logp\x) < 4p max E[/P ( sup \G^{f)\ > k^/ 2 log p ce\X,u] 
\ I i<j<p \feJ^j J 

since (2A;^ logp— 1) > (log 2 — 0.5)fc^ logj> for p > 2. Thus, result (i) holds with Ca := 4\/2ce. 
Result (ii) follows immediately by choosing k = \/l + log(l/ a) / log p to make the right side 
of the display above equal to a. □ 

APPENDIX B: PROOFS OF LEMMAS 1-3 (USED IN THEOREM 2) 

Proof of Lemma 1. (Restricted Set) Part 1. By condition D.3, with probability 1 - 7, 
for every j = 1, . . . ,p we have 1/2 < aj < 3/2, which implies (3.1). 

Part 2. Denote the true rankscores by a*{u) = u — l{yj < x[[3{u)} for i = 1, . . . ,re. Next 
recall that Qu{') is a convex function and E„ [a;ja,*(n)] G dQu{P{u)). Therefore, we have 

QuiAu)) > QuWiu)) + E„ [xia*{u)\' (P{u) - (3{u)). 
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Let D = diag[ai, . . . , ap] and note that X^/u{l — u){co — 3)/(co + 3) > n\\D [xia*{u)] ||oo 
with probabihty at least 1 — a. By optimahty of P{u) for the ^i-penaUzed problem, we have 



< QuiPiu)) - Qu0{u)) + "^^t^ ||/3(n)||i,„ - h/^f^\\p{u)\\,,n 



< 



< 



+ 



A-^ u{l—u) 



D-^En[xia*{u)] D{(3iu) - Piu)) 



Xy/u{l-u) „p / 



+ 



U)\\l,n 
A-y/ u{l—u) 



U)\\l,n 
U)\\hn 



with probability at least 1 — a. After canceling \y^u{l — u)/n we obtain 



(B.l) 



1 



Co - 3 
Co + 3 



{u) - (3{u)\\i,n <Y.^j {\M^) - Pj{u)\ + |/3,(u)| - \P,iu)\) . 



Furthermore, since 

p 



Pj{u) - pj{u) + \/3,{u)\ - %{u)\ = if (3j{u) = 0, i.e. j e 



(B.2) X^a, {\p,{u) - P,{u)\ + |/?,(n)| - |^,(n)|) < 2||^t„M = P{n)\\i,n. 



(B.l) and (B.2) establish that ||/3t^ (u)||i,n < (co/3)||;5t„ (li) - /3(n)||i,„ with probability at 
least 1 — a. In turn, by Part 1 of this Lemma, ||/3T=(^^)||i,n > (1/2)||/3t=(^)||i and ||/3tu(^) — 
P{u)\\i^n < (3/2)||/3t„ (it) — /3('u)||i) which holds with probability at least 1 — 7. Intersection of 
these two event holds with probability at least 1 — a — 7. Finally, by Lemma 7, ||/3(ti)||o < n 
with probability 1 uniformly hi u GU. □ 



Proof of Lemma 2. (Identification in Population) Part 1. Proof of claims (3.3)-(3.5). 
By RE(co, m) and hy 5 € Au 

\\4/'6\\ > ||(E[x.x:])V25||/V2 > \\5Tjf/'.o > ^WStJi > 



V^(l + co) 



Part 2. Proof of claim (3.6). Proceeding similarly to [7], we note that the kth. largest in 
absolute value component of 6t^ is less than ||(5x= Therefore hy 6 £ Au and \Tu\ < s 



< 



{T„UT„((5,m))=ll - 

fc>m+l 



E 



< c, 



m 



so that 1 1 (5 1 1 < y- + cq\Js/ mj \\^TuUTu{S m)\\'^ ^^'^ ^^^^ term is bounded by RE(co, 



m 
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Part 3. Proof of claim (3.7) proceeds in two steps. Step 1. (Minoration). Define the maximal 
radius over which the criterion function can be minorated by a quadratic function 

TA^ = sup |r : Qu{l3{u) + 5)- Qu{[3{u)) > ^|| J^'^f , for ah 6 G A^, || J^'^H < r| . 

Step 2 below shows that r^i^ > 4g. By construction of r^^ and the convexity of Qu, 

QuiPiu) + 5) - QMu)) 



> 



II 7l/2rll2 f II 7I/21II r2 



■'■An 



\q\\Ju'^^\\\ , for any S G 



Step 2. (r/i^ > 4g) Let Fy^^ denote the conditional distribution of y given x. From [18], for 
any two scalars w and v we have that 



(B.3) 



rv 

Pu{w -v) - puiw) = -v{u - l{w < 0}) + / {l{w < z} - l{w < 0})dz. 

Jo 



Using (B.3) with w = y — x' (3{u) and v = x'5 we conclude E [—v{u — l{w < 0})] = 0. Using 
the law of iterated expectations and mean value expansion, we obtain for Zx^z G [0, z] 



(B.4) 



Qu{/3{u) + 6)- Qu{P{u)) = E [/o"'^ Fy\x{x'P{u) + z) - Fy\x{x'f3{u))dz 



E 



lo ' zfy\x{x'P{u)) + ^/^i JrE'/3(n) + Sx,,)dz 



Note that for 6 G A^, if \\Ju^^5\\ < Aq < (3/2) • (/^/V/O • mfs^A^^s^o^ [\x'6\^f^ /E [\x'6\^], 
it follows that {l/6)f'E[\x'6f] < (l/4)/E[|x'(^|2]. This and (B.4) imply r^^ > 4g. □ 

Proof of Lemma 3. (Control of Empirical Error) We divide the proof in four steps. 
Step 1. (Main Argument) Let 



A{t) := e{t)y/n 



sup 

ueu,\\Ju^^s\\<t,SeAu 



\Gn[pu{yi - x[{(3{u) + 6)) - puivi - x-/3(u))]| 



Let be the event in which maxi<j<p \aj — 1| < 1/2, where P{0,i) > 1 — 7. 

In order to apply symmetrization lemma. Lemma 2.3.7 in [37], to bound the tail probability 
of A{t) first note that for any fixed S £ Au, u £ U we have 



var 



[puivi - x'Mu) + 5)) - puivi - x[(3{u))]) < E {x'i5f < t^/ 
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Then application of the symmetrization lemma for probabilities, Lemma 2.3.7 in [37], yields 

(B.5) PiAit) > M) < ^^(^°(/) > m) < ^nA^i^) > + 



where A.°{t) is the symmetrized version of ^(t), constructed by replacing the empirical process 
Gn with its symmetrized version G°, and P(Of) < 7. We set M > Mi := t{Z/fY/'^, which 
makes the denominator on right side of (B.5) greater than 2/3. Further, Step 3 below shows 
that P{A°{t) > M/4|17i) < p-^' for 



M/4 >M2:=t-A- 18\/2 • T ■ ^2\ogp + log(2 + AV2Lf}l'^KQ/t), T = ^^(1 + cq) /{f}''^ kq]. 

We conclude that with probability at least 1 - 37 - 3p"^^ A{t) < Mi V (4M2). 

Therefore, there is a universal constant Ce such that with probability at least 1— 37— 3p~'^^ , 

A{t) <t.CE- ^^|/r^^ \/^log(pV[L/V2^oA]) 

and the result follows. 

Step 2. (Bound on P{A"{t) > K\^}i)). We begin by noting that Lemma 1 and 2 imply that 
Mi,n < i^/i(l + co)\\Ju^^6\\/[f/\o] so that for all uGU 

(B.6) {5 G A : II < t} C {5 G IRf : < 2tr}, T := ^(1 + co)/[/i/2^o]. 

Further, we let Uk = {ui, . . . , u^} be an e-net of quantile indices in U with 
(B.7) e < tr/(2V2sL) and k<l/e. 

By /OM(yi - x'i{P{u) +5)) - puivi - x[l3{u)) = ux[5 + Wi{x[8,u), for Wi{b,u) := {yi - x^/3(n) - 
6)„ - [vi - x^/3(n))_, and by (B.6) we have that A°{t) < B°{t) + C°(t), where 

B"{t):= sup |G°[x^5]| and C°(t) := sup \G"^[wi{5,u)]\. 
«6W,||<5||i,„<2tr uGW,||<5||i,„<2tr 

Then we compute the bounds 

P[B°{t) > K\ni] < mine~^-^E[e^^°W|J7i] by Markov 

< mine"^-^'2pexp((2Atr)V2) by Step 3 

< 2pexp{-K^/{2V2tTf) by setting A = K/{2tTf , 
P[C%t) > K\ni] < mine"^-^E[e^'^°W|Oi,X] by Markov 

< minexp(-AK)2(p/e)exp((16Atr)V2) by Step 4 

< e'^2pexp{-K'^/{16V2tTf) by setting A = K/{mTf , 
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so that 

P[A%t) > 2V2K + 16\/2i^|17i] < P[B°{t) > 2V2K\Qi] + P[C%t) > lQV2K\ni] 

< 2p{l + e-^)exp{-K'^/{tTf). 

Setting K = A-t-T- v'log{2p^(l + e-i)}, for yl > 1, we get P[A°it) > 18V2K\ni] < p-^\ 
Step 3. (Bound on E[e^^°(*) We bound 

E[e^^°W|f]i] < E[exp(2Atrmax|G°(xij)/5j|)|!^i] 

j<p 

< 2pmaxE[exp (2AtrG°(xij)/aj) {Qi] < 2pexp((2Atr)V2), 

j<p 

where the first inequahty follows from |G^[x^(5]| < 2|| maxi<j<p |G°(xjj)/5'j| holding 
under event Qi, the penultimate inequality follows from the simple bound 

E[maxe'^j'] < pmaxE[e'^^'] < pmaxE[e^J' + e~^^] < 2pmaxE[e^-''] 
j<p j<p j<p j<p 

holding for symmetric random variables Zj, and the last inequality follows from the law 
of iterated expectations and from E[exp (2AtrG° (xij)/?^) < exp((2Atr)^/2) holding 

by the Hoeffding inequality (more precisely, by the intermediate step in the proof of the 
Hoeffding inequality, see, e.g., p. 100 in [37]). Here E[-|rii,X] denotes the expectation over 
the symmetrizing Rademacher variables entering the definition of the symmetrized process 
G° . 

Step 4. (Bound on E[e^'^°(*)|J]i]) We bound 

C°(t) < sup ^ sup \G°^[wi{x'^{6 + (3{u)-(3{u)),u)]\ 

ueu ,\u-u\<e iueUk Plli,n<2tr 

+ sup ^ \Gn[Wi{x[{P{u) - P{u)),u)]\ 

uGU ,\u—u\<£ ,U(^Uk 

< 2^ sup \G"^[w^{x[6,u)]\=:V"{t), 

MeWfc,||<5||i,„<4tr 

where the first inequality is elementary, and the second inequality follows from the inequality 

sup \\P{u) - (3{u)\\i,n < V2sL{2 max aj)e < V2sL{2 • 3/2)e < 2tr, 

|n— n|<e 

holding by our choice (B.7) of e and by event Oi. 
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Next we bound E[e^°W|Oi] 

E[e^^°W|f]i] < (l/e)maxE[exp(2A sup \G°^[wi{x'i6,u)]\)\ni] 

"GWfe ||5||i,n<4tr 

< (l/e)maxE[exp(4A sup |G° [x-(5]|)|J]i] 

nGWfe ||5||i,™<4tr 

< 2(p/e)maxE[exp(16AtrG°(xi,)/CT,) {Qi] < 2(p/e) exp((16Atr)V2), 

i<p 

where the first inequahty follows from the definition of Wi and by A; < 1/e, the second 
inequality follows from the exponential moment inequality for contractions (Theorem 4.12 of 
Ledoux and Talagrand [25]) and from the contractive property \wi{a,u) — Wi{b,u)\ < \a — b\, 
and the last two inequalities follow exactly as in Step 3. □ 

APPENDIX C: PROOF OF LEMMAS 4-5 (USED IN THEOREM 3) 

In order to characterize the sparsity properties of P{u), we will exploit the fact that (2.4) 
can be written as the following linear programming problem: 



min Eri 



p 



3 



(C.l) /3+,/3-GR2n+2p L ' 'J n 

Our theoretical analysis of the sparsity of (3{u) relies on the dual of (C.l): 

max E„ [yiQi] 

(C.2) |E„ [xijOj] I < \^/u{l - u)dj/n, j = l,...,p, 

{u — 1) < tti < u, i = 1, . . . ,n. 

The dual program maximizes the correlation between the response variable and the rank 
scores subject to the condition requiring the rank scores to be approximately uncorrelated 
with the regressors. The optimal solution a{u) to (C.2) plays a key role in determining the 
sparsity of f3{u). 

Lemma 7 (Signs and Interpolation Property). (1) For any j £ {1, . . . ,p} 

j3j{u) > iff En[xijai{u)] = xJu{l - u)aj/n, 
(3j{u) < iff En [xijai{u)] = -Xyju{l - u)aj/n, 

(2) ||/?(ii)||o < n Ap uniformly over u £U. (3) If yi, ■ ■ ■ ,yn o,re absolutely continuous condi- 
tional on xi, . . . ,Xn, then the number of interpolated data points, = \{i : yi = x[P{u)}\, is 
equal to \\P{u)\\q with probability one uniformly over u ^U. 
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Proof of Lemma 7. Step 1. Part (1) follows from the complementary slackness condition 
for linear programming problems, see Theorem 4.5 of [6]. 

Step 2. To show part (2) consider any u £ U. Trivially we have ||/?(«)||o < p- Let Y = 
(yi, . . . , Un)' , ? = (S^i, . . . , CTp)', X be the n x p matrix with rows x'^,i = 1, . . . , n, Cu = 
(we', (l-ii)e',AV'^I(T^^?',AV'^I(T^^5')'> and ^ = [/ -/ X -X], where e = (1,1,... ,1)' 
denotes an n- vectors of ones, and / denotes the nxn identity matrix. For w = , /3~^ , /3~), 

the primal problem (C.l) can be written as minu, {c'^^w : Aw = Y,w> 0}. Matrix A has rank 
n, since it has linearly independent rows. By Theorem 2.4 of [6] there is at least one optimal 
basic solution w{u) = {^~^ (u) , ^~ (u) , {u) , (3~ (u)) , and all basic solutions have at most n 
non-zero components. Since (3{u) = (3~^{u) — (3~{u), (3{u) has at most n non-zero components. 

Let lu denote the number of interpolated points in (2.4) at the quantile index u. We have 
that n — lu components of ^~^{u) and are non-zero. Therefore, ||/3(u)||o + {n — ly) < n, 

which leads to ||/3(u)||o < lu- By step 3 below this holds with equality with probability 1 
uniformly over u £ U, thus establishing part (3). 

Step 3. Consider the dual problem maxajya : A'a < Cu} for all u £ U. Conditional on 
X the feasible region of this problem is the polytope Ru = {a : A'a < c„}. Since Cu > 0, 
Ru is non-empty for all u £ U. Moreover, the form of A' implies that R^ C [—1, 1]" so Ru 
is bounded. Therefore, if the solution of the dual is not unique for some u £ U there exist 
vertices a^,a^ connected by an edge of Ru such that Y'{a^ — a^) = 0. Note that the matrix 
A' is the same for all u £ U so that the direction n"? of the edge linking and is 
generated by a finite number of intersections of hyperplanes associated with the rows of A'. 
Thus, the event Y'{a^ — a^) = is a zero probability event uniformly in u £ U since Y is 
absolutely continuous conditional on X and the number of different edge directions is finite. 
Therefore the dual problem has a unique solution with probability one uniformly in u £U. If 
the dual basic solution is unique, we have that the primal basic solution is non-degenerate, 
that is, the number of non-zero variables equals n, see [6]. Therefore, with probability one 
||^(n)||o + (n - /„) = n, or ||^(n)||o = lu for ah u£U. □ 

Proof of Lemma 4. (Empirical Pre-Sparsity) That s <nf\p follows from Lemma 7. We 
proceed to show the last bound. 

Let a{u) be the solution of the dual problem (C.2), Tu = support(/3(u)), and Su = 
ll/^(^)llo = \Tu\- For any j £ Tu, from (C.3) we have {X'a{u))j = sign{l3j{u))Xajy^u{l — u) 
and, for j ^ Tu we have sign(/?j(n)) = 0. Therefore, by the Cauchy-Schwarz inequality, and 
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by D.3, with probability 1 — 7 we have 



sign(/3(?x))'sign(/3(u))A < sign(/3(w))'(X'a(u))/ minj=i^,,,^pCTj^u(l — u) 



< 2\\Xsign{p{u))\\\\a{u)\\/y/uil-u) < 2y^^0(?;j||sign(/3(n))|| ||a(n)||/V^i(l - u), 



where we used that ||sign(/3(n))||o = s« and mini<j<pCTj > 1/2 with probabihty 1 — 7. Since 
||a(n)|| < ^/n, and ||sign(/3(u))|| = -v/s^ we have s^A < 2ny^Su^(s^VI4/- Taking the supremum 
over u £ U on both sides yields the first result. 

To establish the second result, note that s < fh = max {m : m < n Ap A An'^ cl){m)W^ / X'^} . 
Suppose that m > ttiq = n/log(n Vp), so that fh = tuqI for some ^ > 1, since m < n is 
finite. By definition, fh satisfies fh < 4n^(/)(m)W^/A^. Insert the lower bound on A, mg, and 
fh = moi in this inequality, and using Lemma 11 we obtain: 



m = mo£ < 



8W^nlog(nVp) 4>{mo) 
which is a contradiction. 



< 



n 



2 log (77, V p) 



< 



n 



\og{n V p) 



i = moi, 



□ 



Proof of Lemma 5. (Empirical Sparsity) It is convenient to define: 

1. the true rank scores, a*{u) = u — l{yi < x^/3(u)} for i = 1, . . . , n; 

2. the estimated rank scores, ai{u) = u — l{yi < x^(3{u)} for f = 1, . . . , n; 

3. the dual optimal rank scores, a{u), that solve the dual program (C.2). 



Let Tu denote the support of (3{u), and 



n)||o. Let X 



iTu 



{xij/djj G Tu)', and 



(u) = {Pj{u),j G Tu)' . From the complementary slackness characterizations (C.3) 



(C.4) 



\sign{(3c: {u) 

J- u 



A\/ii(l — u) 



Therefore we can bound the number of non-zero components of (3{u) provided we can 
bound the empirical expectation in (C.4). This is achieved in the next step by combining the 
maximal inequalities and assumptions on the design matrix. 
Using the triangle inequality in (C.4), write 



< sup 



nEri 



nE„ (aj(u) - a.*(u)) 



<(") 
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This leads to the inequality 



xVf < 



Wu 



( sup 









x.s? {ai{u) - at{u)) 



sup 



nEn 



X- (ai(u) - a*(u)) 



+ 



' sup 



Then we bound each of the three components in this display. 

(a) To bound the first term, we observe that ai{u) / aj(n) only if yi = x'-(3{u). By Lemma 
7 the penalized quantile regression fit can interpolate at most < s points with probability 
one uniformly over u gU. This implies that E„ [|ai(u) — ai(n)p] < s/n. Therefore, 



sup 

ueu 



x.c; {ai{u) — ai{u)) < n sup sup E„ [la'xjl \ai{u) — ai{u) 



0<s,\\a\\<lu&U 

<n sup sup y^E„ [|oi('u) - 0,(14)12] < ^n(l){s)s. 



a n<s, a <1 



(b) To bound the second term, note that 



sup 
< sup 



nE^ 



x^fjai{u) -a*{u)) 
/n G„ (x.^^{ai{u) - a*{u 
< ^/n€i{r, s) + ^/ne2{r, s). 
where for inifi.u) = {l{yi < x-/J} - u)xi, 



+ sup 



nE 



x.;fjai{u) - a*{u)) 



(C.5) 



ei(r,m) := sup {Gniaipi^fdju)) - Gn{aipi{P{u),u))\, 

€2{r,m) := sup \^\E[a'ipi{P,u)] — E[a'ipi{P{u),u)]\, and 

u£U,l3GRn{r,m),aeS{l3) 



Ru{r,m) := {P e IRP : P - P{u) e : WPh < m, - f3{u))\\ < r }, 

§(/?) := {a G IRP : ||q|| < l,support(a) C support(/3)}. 

By Lemma 10 there is a constant A^^^ such that i/nei(r, s) < yl^^2'\/^s~log(n V p)\/4>{s) with 
probability 1 — e/2. By Lemma 8 we have \/ne2{r, s) < n{fi{s)/2){r A 1). 

(c) To bound the last term, by Theorem 1 there exists a constant A^^^ such that with 
probability 1 — e/2 

where we used that a|(ti) = u — l{ui < u}, i = 1, ■■■n, for ui, ...,Un i.i.d. uniform (0, 1). 



sup 

uGU 



nEn 



X 
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Combining bounds in (a)-(c), using that mmj=i^___^paj > 1/2 by condition D.3 with prob- 
abihty 1 — 7, we have 

— < Ks) j{rM) + ^TsK, ^ , 

with probabihty at least 1 - e - 7, for if^ = 2(1 + A^^^^ + Al^^). □ 

Next we control the linearization error £2 defined in (C.5). 
Lemma 8 (Controlling linearization error €2)- Under D.1-2 

e2(r, m) < \/n\J if{m) |l A ^2[///"'^/^]r^ | for all r > and m < n. 

Proof. By definition 

€2{r,m)= sup ^/n\Y,[{a' Xi) {l{yi < x[[3} - l{yi < x'i[3{u)})]\. 

By Cauchy-Schwarz, and using that ip{m) = sup||Q,||<i ||Q,||g<„ E[|Q;'rEj|^] 

e2{r,m)<^J^) sup \/n{l{y^ < x[P] - l{y, < xrp{u)}f]. 

Then, since for any /3 G Ru{r, rn), u £ U, 

E[{l{yi < x[P} - l{y, < x'Mn)}f] < E [l{\y^ - x^/?(n)| < |x^(/3 - Piu))\}] 
< E [(2/>^(/3 - (3{nm A 1] < {2f{E[\x[{P - P{umf'} A 1 

and {E[\x[{P - /?(u))|2])^/^ < \\Ju^^{p - p{u))\\/f/^ by Lemma 2, the result follows. □ 

Next we proceed to control the empirical error ei defined in (C.5). We shall need the 
following preliminary result on the uniform L2 covering numbers ([37]) of a relevant function 
class. 

Lemma 9. (1) Consider a fixed subset T C {1,2, ... ,p}, \T\ = m. The class of functions 

Tt = {oL{^i{p,u) - 'ipi{P{u),u)) :ueU,ae S(/3), support (/3) C T} 

has a VC index hounded by cm for some universal constant c. (2) There are universal con- 
stants C and c such that for any m < n the function class 

Tr.^ = {a'(^i(/3,n) - i,i{l5{u),u)) : n G Z^, /3 G R^, ||/?||o < m,a G S(/?)} 
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has the the uniform covering numbers bounded as 

sup7V(e||F„||Q,2,J-„,L2(Q)) <C - , e > 0. 



e / \ 771 



Proof. The proof of part (1) follows by showing that the corresponding subgraph class 
is created by at most K operations of taking unions, intersections, and complements of VC 
classes of sets with VC index at most m, and then appealing to [37] Lemma 2.6.17. We 
relegate the details to [4] for brevity. 

To show part (2) let JFy denote a restriction of J-'m for a particular choice of m non-zero 
components. Part (1) implies N{e\\FT\\Q,2,J'T, L2{Q)) < C(cm)(16e)'='"(l/e)2('="-i) , where 
C is a universal constant (see [37] Theorem 2.6.7). Since we have at most (^) < {ep/m)"^ 
different restrictions T, the total covering number is bounded according the statement of the 
lemma. □ 

Lemma 10 (Controlling empirical error ei). Under D.1-2 there exists a universal constant 
A such that with probability 1 — 5 



(r, m) < A6 ^ m log(n y p)\J (t){m) uniformly for all r > and m <n. 



Proof. By definition ei(r, m) < supjgj?r^^^ |G„,(/)|. Prom Lemma 9 the uniform covering 
number of J-m is bounded by C (16e/e)^'''^"^~^^ (ep/m)™. Using Lemma 17 with 6m = p we 
have that uniformly in m < n, with probability at least 1 — 6 

(C.7) sup |G„(/)| <A5-^/ym\og{nyp) max! sup Eif^]^^, sup E„[/2]v4 

By \a' {ipi{f3,u) — ipi{P{u),u)) \ < \a'xi\ and definition of (j){m) 

(C.8) E„[/2] < En[\a'xi\^] < (Pirn) and E[f^] < E[\a'x^\^] < </>(m). 

Combining (C.8) with (C.7) we obtain the result. □ 

(c) The next lemma provides a bound on maximum fc-sparse eigenvalues, which we used 
in some of the derivations presented earlier. 

Lemma 11. Let M be a semi-definite positive matrix and (pMik) = sup{ a' Ad a : a G 
IR^, ||a|| = 1, ||a||o < ^ }• For any integers k and ik with £ > 1, we have cpMi^k) < \i'](j)M{k)- 
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Proof. Let a achieve (j)M{^k). Moreover let J^l^iCXi = a such that J^l^i ||aj||o = ||"||o- 
We can choose aj's such that ||at||o < k since {i^k > ik. Since M is positive semi-definite, 
for any i,jw a[Mai + a'^Muj > 2 \ a'^Maj \ . Therefore 

{£■] m {£■] 

(PMiik) = a' Ma = J2 a'iMoi + "i^aj < J2 [a'iMai + {\€\ - l)a[Mai] 

i=l i=l jj^i 1=1 



< \i]y,\\aif^M{\\aih)<\i] max 

»=i,...,m 



where we used that J2i=i = 1- ^ 

APPENDIX D: PROOF OF THEOREM 4 

Proof of Theorem 4. By assumption sup^g^^ \\f3{u)—P{u)\\oQ < sup^g^ \\j3{u)—P{u)\\ < 
r° < infygi^ miujgr^ |/3j(u)|, which immediately implies the inclusion event (3.15), since the 
converse of this event implies \\P{u) — P{u)\\oo > infug^^ miujgT-^ \(3j{u)\. 

Consider the hard-thresholded estimator next. To establish the inclusion, we note that 
inf„g2^ minjeT„ \Pj{u)\ > inf„ew minjgT„{|/?j(^^)| - - > inf„ez^ min^gT^ - 

r° > 7, by assumption on 7. Therefore inf^gi^ miujgTu > 7 and support {f3{u)) C 

support {(3{u)) for all u gU.To establish the opposite inclusion, consider = sup^g^maxj^^^^ 
By definition of r°, e„ < r° and therefore e„ < 7 by the assumption on 7. By the 
hard-threshold rule, all components smaller than 7 are excluded from the support of P{u) 
which yields support {fl{u)) C support {[5{u)). □ 

APPENDIX E: PROOF OF LEMMA 6 (USED IN THEOREM 5) 

Proof of Lemma 6. (Sparse Identifiability and Control of Empirical Error) The proof 
of claim (3.17) of this lemma follows identically the proof of claim (3.7) of Lemma 2, given 
in Appendix B, after replacing with A^- Next we bound the emprical error 



(E.l) 



\eu{5)\ ^ 1 

IIAII - II rii r- 



< e-i{rh) := —= sup |G„,(/)| 



5'G„(Vi(/3(n)+75,n))d7 



where the class of functions JT- is defined in Lemma 12. The result follows from the bound 
on e3(m) holding uniformly in ?n < n given in Lemma 13. □ 
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Next we control the empirical error es defined in (E.l) for defined below. We first 
bound uniform covering numbers of T:^ . 

Lemma 12. Consider a fixed subset T C {1,2, Tu = support(/3(ti)) such that 

\T\Tu\ < m and \Tu\ < s for some u gU. The class of functions 

Tt,u = {a'xi{l{yi < x[f3} - u) : a € §(/?), support(;3) C T} 

has a VC index bounded by c{m + s) + 2. The class of functions 

T~={Tt,u: nG^,rc {l,2,...,p},|T\r„| < m} , 

obeys, for some universal constants C and c and each e > 0, 

supiV(e||F~||Q,2,-F~,L2(Q)) < C (32e/e)^W^+^)+2) ^2™, |u^^^j.^|2s _ 
Q 

Proof. The class !Ft,u is a subset of !Ft '■= Qt + where Qt = {a'x-i ■ l{yi < x'-(3} : 
a £ support(/3) C T} and TCt = {—v ■ a'xi : v £ly(,a £ support(/3) C T}. The 

VC index of Qt and Tix is bounded by c\T\. Therefore the VC index of J^t is bounded by 
2c\T\ + 2 < V = 2c{m + s) + 2, for every u £U, which shows the first result. 

To show the second result, we first note that the uniform covering numbers of J-'t are 
bounded by supg iV(e||FT||Q,2, ^r, ^2((5)) < C{2V){We)^^+^{l/ey^^^-^\ where C is a uni- 
versal constant (see [37] Theorem 2.6.7). We also note that J-~ is a subset of J-~ = {J-'t ■ 
T C {1,2, . . . ,p}, \T \ Tu\ < m, u £ U}. Therefore, the bound stated in the lemma now 
follows by taking the product of the bound on the uniform covering numbers above with 
the total number of different function sets J-'t, indexed by models T, that generate J--^, 
followed by some simplifications. To bound the number of function sets, first, note that 
for any fixed u £ U, since |T \ Tu\ < fh, we can pick at most maxj^<^<~ (^) < p"^ dif- 
ferent models T; second, note that by varying across u £ U, we can generate at most 
ELi ^-1) < ELi I U„ew Tul'' < 2| Uueu Tu\' different sets r„ since s < \ U^eu 
The number of sets Tt is therefore bounded by • 2| Uueu Tul'^. □ 

Lemma 13 (Controlling empirical error £3). Suppose that D.l holds and \^u£uTu\ ^ 
There exists a universal constant A such that with probability at least 1 — 6, 

£3(771) := sup |G„(/)| < A5~^^'^\J {fh log(77 y p) + s log n)4>{fh -\- s) for all fh <n. 

m 
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Proof. Lemma 12 bounds the uniform covering number of Using Lemma 17 with 
m = fa and 6m = ■ n^^^^ , we conclude that uniformly in m < n 

(E.2) sup |G„(/)| < ^ri/ymlog(nV0^)-maxJ sup E[/2]V', sup E„[/2]V' [ 
}&r~ fej^~ far- 

with probability at least 1 — 6. The result follows, since for any / G the corresponding 
vector a obeys||a||o < fh + s, so that E„[/^] < E„[|a'xip] < (l){m + s) and E[/^] < E[|a'xip] < 
(j){fh + s) by definition of (j){fh + s). □ 

APPENDIX F: MAXIMAL INEQUALITIES FOR A COLLECTION OF EMPIRICAL 

PROCESSES 

The main results here are Lemma 14 and Lemma 17, used in the proofs of Theorem 1 
and Theorem 3 and 5, respectively. Lemma 17 gives a maximal inequality that controls the 
empirical process uniformly over a collection of classes of functions using class-dependent 
bounds. We need this lemma because the standard maximal inequalities applied to the union 
of function classes yield a single class-independent bound that is too large for our purposes. 
We prove Lemma 17 by first stating Lemma 14, giving a bound on tail probabilities of a 
separable sub-Gaussian process, stated in terms of uniform covering numbers. Here we want 
to explicitly trace the impact of covering numbers on the tail probability, since these covering 
numbers grow rapidly under increasing parameter dimension and thus help to tighten the 
probability bound. Using the symmetrization approach, we then obtain Lemma 16, giving a 
bound on tail probabilities of a general separable empirical process, also stated in terms of 
uniform covering numbers. Finally, given a growth rate on the covering numbers, we obtain 
Lemma 17. 

Lemma 14 (Exponential Inequality for Sub-Gaussian Process). Consider any linear zero- 
mean separable process : / G T}, whose index set T includes zero, is equipped with a 
L2{P) norm, and has envelope F. Suppose further that the process is sub-Gaussian, namely 
for each g e - J^: P{|G(5r)| > ??} < 2exp (-^??^/-D^||fl'||p^2) f^"^ "■'^V V > '^i^h D a 
positive constant; and suppose that we have the following upper bound on the L2{P) covering 
numbers for T : 

iV(e||F||p,2,J^,i2(P)) < n(e,T,P) for each e > 0, 
where n{e,J-, P) is increasing in 1/e, and ey^\ogn{e,J^, P) as 1/e^oo and is decreasing 
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inl/e. Then for K > D , for some universal constant c < 30, p{J-,P) := supjgjp ||/||p^2/||-P||p,2; 



The result of Lemma 14 is in spirit of the Talagrand tail inequality for Gaussian processes. 
Our result is less sharp than Talagrand's result in the Gaussian case (by a log factor), but it 
applies to more general sub-Gaussian processes. 

In order to prove a bound on tail probabilities of a general separable empirical process, we 
need to go through a symmetrization argument. Since we use a data-dependent threshold, 
we need an appropriate extension of the classical symmetrization lemma to allow for this. 
Let us call a threshold function x : IR*^ i— > IR A;-sub-exchangeable if for any v,w £ IR" and 
any vectors v, w created by the pairwise exchange of the components in v with components 
in w, we have that x{v) V x{w) > [x{v) V x{w)]/k. Several functions satisfy this property, in 
particular x{v) = \\v\\ with k = y/2 and constant functions with k = \. The following result 
generalizes the standard symmetrization lemma for probabilities (Lemma 2.3.7 of [37]) to the 
case of a random threshold x that is sub-exchangeable. 

Lemma 15 (Symmetrization with Data-dependent Thresholds). Consider arbitrary in- 
dependent stochastic processes Zi, . . . , Z„ and arbitrary functions . . . , fin J- H. Let 
x{Z) = x{Zi, . . . , Zn) be a k- sub -exchangeable random variable and for any r G (0, 1) let qr 
denote the r quantile of x{Z), pr := P{x{Z) < q-r) > r, and Pr ■= P{x{Z) < q^) < r. Then 



i=l 



> x^y x(Z) < —P 

/ Pt 



i=l 



Xn V x(Z) \ 
> +Pr 



4k 



where xq is a constant such that infjgjpP (Er=i ^iif)\ ^ ^) ^ 1 ~ ^• 



Note that we can recover the classical symmetrization lemma for fixed thresholds by setting 
k = 1, Pr = 1, and Pr = 0. 

Lemma 16 (Exponential inequality for separable empirical process) . Consider a separable 
empirical process G„ (/) = n-^/'^J27=i{f{Zi) - E[/(Zi)]} and the empirical measure P„ for 
Zi, . . . ,Zn, an underlying i.i.d. data sequence. Let K > 1 and r G (0,1) be constants, and 
Cni^, IPn) = GniJ^, Zi, . . . , Zn) be a k-sub- exchangeable random variable, such that 



logn(e, J',P„)de < e„(:r, 



,) and sup var^f < --{AkcKen{T ,^n)f 
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for the same constant c > as in Lemma I4, then 

^'0/2 , r,,9 1 \ 

A 1 + T. 



sup |G„(/)| > ikcKeniJ',rn) \ < 



T 







Finally, our main result in this section is as follows. 

Lemma 17 (Maximal Inequality for a Collection of Empirical Processes). Consider a 
collection of separable empirical processes G„(/) = n~^^'^J27=i{fi^i) ~ '^ifi^i)]}) where 
Zi, . . . , Zn is an underlying i.i.d. data sequence, defined over function classes J-'m,m = 
l,...,n with envelopes = supj-^jr^\f{x)\,m = l,...,n, and with upper bounds on the 
uniform covering numbers of Tm given for all m by 

n{e,Tm,¥n) = {nVOmri^/er", < e < 1, 
with some constants to > 1, v > 1, and 9m > Oq. For a constant C := (1 + \/2t;)/4 set 



e„(JF„,,P„) = CWmlog(n V 6'm V w) max ^ sup ||/||p,2, sup 

Then, for any 5 £ (0, 1/6), and any constant K > \/2/6 we have 

sup |G„(/)| < 4.\/2cKen{J^m,'^n), for all m < n, 

with probability at least 1—5, provided that n V 9q > 3; the constant c is the same as in 
Lemma 1^. 

Proof of Lemma 14. The strategy of the proof is similar to the proof of Lemma 19.34 
in [35], page 286 given for the expectation of a supremum of a process; here we instead bound 
tail probabilities and also compute all constants explicitly. 

Step 1. There exists a sequence of nested partitions of JT, {{!Fqi,i = 1, . . . , A'g), g = 9o + 
1, . . .} where the g-th partition consists of sets of L2{P) radius at most ||F||p^22~'^, where 
is the largest positive integer such that 2"'^" < p[T,P)/4: so that go > 2. The existence of 
such a partition follows from a standard argument, e.g. [35], page 286. 

Let fqi be an arbitrary point of !Fqi. Set vrg(/) = fqi if / G J^qi. By separability of the 
process, we can replace by Uq^ifqi, since the supremum norm of the process can be computed 
by taking this set only. In this case, we can decompose f — T^qoif) = J2'^qo+i{'^qif)~'^q-i{f))- 
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Hence by linearity G(/) - G(7r,„(/)) = E^,o+i '^i^M) " so that 

oo oo ^ X 

f{sup|G(/)|> ^7?,} < J2 ff^ max|GK(/)-7r,_i(/))| >r/, 

f^^ q=go q=qo+l ^ ^ ^ 



+ P |max |G(7r,„(/))| >r7,„|, 

for constants r]q chosen below. 

Step 2. By construction of the partition sets ||vrq(/) — '7rq_i(/)||p^2 ^ 2||F||p.22~(''~^) < 
4||-F||p,22~'^, for q > qo + l. Setting 7]q = 8K||F||p^22~'^-\/log Nq, using sub-Gaussianity, setting 
K > D, using that 21ogA'^g > log NqNg^i > logn^, using that q ^ logUq is increasing in q, 
and 2-*' < p(J^,P)/4, we obtain 



oo 1^ ^ oo 

P max|G(7r,(/)-7r,_i(/))|>r/, < ^ iV,iV,„i2 exp (-r?2/(4D||F||p22 
q=qo+i ^ ■' ^ g=«o+i 

oo oo 

< NqNq^i2exp (^-{K/ 2log Nq) < ^ 2 exp (-{(K/i^)^ - 1} log r 



q=qo + l i2='?0+l 



<J 2exp (^-{{K/Df -l}\og nqjdq = J {xln2)-^2n{x,J^,Pr^^^/^^ ~^Ux. 



By Jensen ^Th^Nq < aq := E,=,„ V^o^q, so that E^go+i ^i ^ ^E^go+i K\\F\\p^22~'^aq 
Letting bq = 2 ■ 2^'^, noting a^+i — ag = y^log n^+i and — bq = — 2~'^, we get using 
summation by parts 

oo oo oo 

J2 2^''aq = - J2 ibq+l-bq)aq = -aqbq\'^^ + i+ K+l " Cg) 

9=90 + 1 9=90 + 1 9=90 + 1 

oo 



2.2-(^o+i)y^logn,„+i+ Y 2.2-('^+i)y^logn,+i = 2 ^ 2-yiog 

9=90+1 9=90+1 



where we use the assumption that 2 '^^y^ognq — > as g — > oo, so that — a^bgl^+i = 2 
2-(9o+i) ^log nqy+i . Using that 2~'?^log Uq is decreasing in q by assumption, 

2 2-'?.^logn, <2 / 2~''^J\ogn{2-i,F,P)dq. 

9=90+1 

Using a change of variables and that 2"'^° < p{J-, P) /4, we finally conclude that 
E ^9 < ^II^I|P,2^ / ^\ogn{x,J^,P)dx. 

9=90+1 ^ "'O 
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Step 3. Letting 77^^ = K\\F\\p^2p{^, P)\/^^^ogW^, recalling that Ng^ = Ug^, using that 
ll'^go(/) 11^^,2 ^ ll-^l|p,2 and sub-Gaussianity, we conclude 

p{max|G(7rgo(/))| > r/g,} < riglexp (-{K / Df logUg) < 2exp (-{{K/Df - Ijlogn,) 
< r 2 exp (-{{K/Df - 1} log dq = [^'^^'^^^'^ {x In 2)-^2n(a;, J^, P)~'^^^/^^''^'^Ux. 

Jqa~l ^ ' Jp{r,P)/i 

Also, since ng,, = n{2~'^° , P), 2~^° < P)/4, and n{x,J^,P) is increasing in 1/x, we 
obtain r]g^ < iV2K\\F\\p^2 Jo^^'^^^^ ^logn{x,J', P)dx. 

Step 4. Finally, adding the bounds on tail probabilities from Steps 2 and 3 we obtain the 
tail bound stated in the main text. Further, adding bounds on rjq from Steps 2 and 3, and 
using c = 16/log 2 + Ay/2 < 30, we obtain Xl^g,, Vg < cK\\F\\p^2 lo^^'^^^'^ y/logn{x,J=', P)dx. 

□ 



Proof of Lemma 15. The proof proceeds analogously to the proof of Lemma 2.3.7 (page 
112) in [37] with the necessary adjustments. Letting q^- be the r quantile of x{Z) we have 



P 



1=1 



> xo Vx(Z) \ <P{x{Z) > 



> XoVx{Z) }+P{x{Z) < qr}. 



Next we bound the first term of the expression above. Let Y = (li, . . . , Yn) be an independent 
copy of Z = [Zi, . . . , suitably defined on a product space. Fix a realization of Z such 
that x{Z) > qr and \\Ei=i Zi\\jr > xqV x{Z). Therefore 3fz G J" such that ELi Zi{fz)\ > 
xq V x{Z). Conditional on such a Z and using the triangular inequality we have that 



PY{x{Y)<qr,\J:7=ly^Uz)\<f} < { I ELl " ^0 (/z) I > 



xoVx{Z)Vx{Y) ' 



By definition of xo we have inf/e^P{|Er=i^i(/) I < ^} ^ l-Pr/2. Since Py {x(y) < g^} = 
Pr, by Bonferroni inequality we have that the left hand side is bounded from below by 

pT — Prf^ = Pt/2. Therefore, over the set {Z : x{Z) > qr, \\J27=i ^illjr > xq V x{Z)} we have 

zy 

2 



f < PY{\\E7=iiYi -Zi)\\^> ^oVx(z)vx(y) | _ Integrating over Z we obtain 



?^pLiZ)>qr, 



E^^ 



> xo V x{Z) } < PzPy 



> 



xa V x{Z) V x(Y) 



Let 

£i, . . . , En be an independent sequence of Rademacher random variables. Given ei, . . . , e^, 
set (1^ = Yi,Zi = Zi) if Ej = 1 and {Yi = Zi, Zi = Yi) if £i = —1. That is, we create vectors 
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Y and Z by pairwise exchanging their components; by construction, conditional on each 
£i, . . . ,en, {y, Z) has the same distribution as (Y, Z). Therefore, 



PzPy 



> 



xo \/ x{Z) W x{Y) 



EePzPy 



> 



xq V x{Z) V x{Y) 



3^ 



By x{-) being A;-sub-exchangeable, and since ei{Yi — Zi) = (1^ — Zj), we have that 

xo V x{Z) V x(r) 



> 



2k 



By the triangular inequality and removing x(y) or x{Z), the latter is bounded by 

^ 2;oVx(Z)| 



P 



□ 



Proof of Lemma 16. Let G5^(/) = X]r=i{^i/(-^j)} be the symmetrized empirical 

process, where £!,...,£„ are i.i.d. Rademacher random variables, i.e., Pisi = 1) = P(ej = 
— 1) = 1/2, which are independent of By the Chebyshev's inequality and the 

assumption on e„(J^, Pji) we have for the constant r fixed in the statement of the lemma 

(4fcci^e„(^,P„))2 (4A;cKe„(^,F„)f - ' 
Therefore, by the symmetrization Lemma 15 we obtain 

p|sup|G„(/)| > ^kcKen{T,^n) \ < -p|sup|G°(/)| >cKe„(.F,P„)l +r. 

We then condition on the values of Zi , . . . , Z„ , denoting the conditional probability mea- 
sure as Pg. Conditional on by the Hoeffding inequality the symmetrized pro- 
cess G° is sub-Gaussian for the L2(Pn) norm, namely, for (7 G JF — Fi;{G'^(g) > x} < 
2 exp(— a;^/[2||(7||p 2])- Hence by Lemma 14 with D = 1, we can bound 



I'- 


ll. 







A 1. 



The result follows from taking the expectation over Zi , . . . , Z„ . 



□ 



Proof of Lemma 17. Step 1. (Main Step) In this step we prove the main result. First, 
we observe that the bound e 1-^ n(e, Trn, Pn) satisfies the monotonicity hypotheses of Lemma 
16 uniformly in m < n. 
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Second, recall e„(:r^,P„) := Cy^mlog^n V 6*^ V cj) max{sup jgjr^ ll/l|p,2, supf^jr^ ll/l|p„,2} 
for C = (1 + -v/2u)/4. Note that supjgjp^ ||/||p„,2 is \/2-sub-exchangeable and /3(jr^,P„) := 
supjgjr^ ||/||p„,2/||-^m||p„,2 > by Step 2 below. Thus, uniformly in m < n: 



||-?^m||p„,2 \/logn(e, J^,P)de 

< ||-^m||p„,2 ^mlog{nV 6m) + vmlog{u/e)de 

< {l/A)Jm\og{n\/ Om) sup ||/||p„,2 + ||-Pm||p„.2 / J vm\og{ijJ / e)de 

< yJmlog{nVdmVuj) sup ||/||p„,2 (l + V2^) /4 < e„(^„,P„), 

which follows by ,/log{uj/e)de < (/(f Ide)^^^ (/(f log(w/e)de)^/^ < p^2log{n V uj), for l/y^ < 
P<1. 

Third, for any > > 1 we have (i^^ - 1) > l/S, and let = (5/(4m log(7T, V 6'o)). 

Recall that 4-v/2cC > 4 where 4 < c < 30 is defined in Lemma 14. Note that for any m < n 
and / G we have by Chebyshev inequality 

|2 



P(|G„(/)| > AV2cKe4J^M ) < < ^/^^ , , , < r™/2. 

(4V2ci^e„(J^m, Pn))^ (4V2cC)^m log(?i V 9o) 

By Lemma 16 with our choice of Tm, m < n, uj > 1, v > 1, and p{J^rn,^n) < 1, 

P| sup |G„(/)| > 4V2cKeni:Fm,rn),3m < n} 

< p| sup |G„(/)| > Ay/2cKen[TrrM\ 

m=l {f&^m ) 

n 

<E 

m=l . 

m=l ' m=l 

(nV6'o)"^/'^ , , 5(l+logn) 
1 - (n V 6lo)-i/'5 41og(nV6lo) ~ 

where the last inequality follows by n V ^ 3 and 5 G (0, 1/6). 

Step 2. (Auxiliary calculations.) To establish that supjg^r^ ||/||p„,2 is \/2-sub-exchangeable, 
let Z and Y be created by exchanging any components in Z with corresponding compoents 



Tm Jo 
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in Y. Then 

^^(SUP ||/||p„(^),2 V sup ||/||p„(y),2) ^ (sup II / llp„(^),2 + ^Up II / Ilp4y),2) 

> ( sup Kn[f{Zif]+En[f{Y,)^])'/' = ( sup E„[/(Z,)2] +E„[/(y,)2])V2 

> ( sup ||/||p„(z),2 V sup )V2= sup 2 V SUp ||/||p„(y),2- 

/eJ^™. /G^™ /G^™ /G^,„ 

Next we show that P„) := supjgjr^^ ||/||p„,2/||-^m||p„,2 > 1/y/n for m < n. The 

latter follows from E„ [F^] = E„[supjgjp,^J/(Zi)p] < supj<„ supjgjp,^J/(Zj)p, and from 
sup^g^^ > sup^g^^ sup,<„ 1/(^01 V^- □ 
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